Files
thejayman77 ec82764bef Initial commit: retroDE_ps2 — first-of-its-kind PS2 GS FPGA core (DE25-Nano / Agilex 5)
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression
(272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps,
and all dump-derived textures/traces) is excluded via .gitignore and stays local.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 20:10:50 -04:00

4234 lines
232 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# GIF/GS Contract
Status: `Draft`
## Purpose
Define the graphics ingress and rendering/display boundary.
## Owns
- GIF path intake and arbitration,
- GIF tag interpretation,
- GS register decode,
- GS VRAM-visible operations,
- framebuffer/zbuffer/texture-visible state handling,
- PCRTC/display output generation or a planned approximation layer.
## Inputs
- DMAC channel 2 traffic,
- VIF/VU-generated graphics traffic,
- privileged GS register writes,
- reset and display configuration controls.
## Outputs
- VRAM updates,
- display timing and pixel output,
- status/interrupt signals,
- packet and register trace events.
## Questions to lock
- What is the first output milestone:
- GS privileged register acceptance only
- static background color
- minimal primitive draw
- `gsKit`-style demo target
- Is Phase 1 display based on a faithful GS/PCRTC path or a temporary adapter?
- What VRAM organization assumptions must stay stable from the beginning?
## Allowed early stubs
- privileged-register-only GS stub,
- BGCOLOR/test-pattern display path,
- packet logger with no rendering.
## Required debug visibility
- GIF tags,
- PATH source and arbitration result,
- GS register writes,
- VRAM write summaries,
- display mode transitions.
## First meaningful milestone
- a known packet stream or direct privileged-register sequence produces a stable,
visible, repeatable output and matching trace.
## GS write-port contract (Ch75)
The GS model has **two architecturally distinct write ports** because real PS2
hardware exposes two unrelated register namespaces. Conflating them was a Ch74
mistake; Ch75 split them.
### `reg_wr_*` — privileged GS/MMIO writes
- Source: CPU MMIO writes to the `0x12000000` privileged-register block, e.g.
via `platform_video_stub` or a direct test-harness path.
- Address: `reg_wr_addr[15:0]` is the offset *inside* the privileged block.
- Examples: `BGCOLOR` at offset `0x00E0`, `PMODE` at `0x0000`,
`SMODE2` at `0x0020`, etc.
- Currently latched: `BGCOLOR` only. Other offsets emit `EV_MODE`.
### `gif_reg_*` — GIF A+D register-number writes
- Source: `gif_packed_stub` consuming a PACKED A+D entry when run with
`REAL_AD_REG_MAP=1` (the new default-on path for real PS2 packets;
parameter still defaults to `0` for back-compat with project-local
Ch72/Ch73 PACKED-A+D layout).
- Address: `gif_reg_num[7:0]` is the **GIF A+D register number** straight
out of the PACKED entry's `in_data[71:64]`. Source-of-truth is PCSX2
`pcsx2/GS/GSRegs.h`.
- Currently decoded: `PRIM=0x00`, `RGBAQ=0x01`, `XYZF2=0x04`, `XYZ2=0x05`,
`FRAME_1=0x4C`, `ZBUF_1=0x4E` (**not `0x4F` — that is `ZBUF_2`**).
Each has a dedicated 64-bit latch output. Other reg numbers emit `EV_MODE`.
### Event taxonomy
The two write paths emit different events. Read this carefully — `arg2`
semantics differ across emitters.
- `EV_BGCOLOR` — emitted **only** by `gs_stub` on the privileged port
when `reg_wr_addr == 0x00E0`. Carries the unpacked R/G/B in
`arg0`/`arg1`/`arg2`. The privileged port has no per-register
"selector" beyond this dedicated event; everything else on that port
goes to `EV_MODE` with `arg0=offset`, `arg1=data`.
- `EV_WRITE` — emitted in two places with different `arg2` semantics:
- **`gif_packed_stub`** on a PACKED A+D accept (REGS nibble = `0xE`).
Carries the raw PACKED address bits in `arg2` (`{48'd0,
in_data[79:64]}`). Under `REAL_AD_REG_MAP=1` the low 8 bits are the
real GIF reg# (`in_data[71:64]`); under `REAL_AD_REG_MAP=0` the low
16 bits are the project-local privileged-style offset. **Not a
stable selector — it is the address half of the wire.**
- **`gs_stub`** on the `gif_reg_*` port for a tracked GIF reg
(PRIM/RGBAQ/XYZF2/XYZ2/FRAME_1/ZBUF_1). Carries a **stable
per-register selector** in `arg2`: `1=PRIM, 2=RGBAQ, 3=XYZF2,
4=XYZ2, 5=FRAME_1, 6=ZBUF_1, 7=TEX0_1` (Ch98). `arg0=reg#`,
`arg1=data`. Use this
selector for trace-side filtering; it does not depend on
`REAL_AD_REG_MAP`.
- **Ch76 caveat**: a tracked vertex commit (XYZ2 or XYZF2) on the
`gif_reg_*` port that *closes* a primitive does NOT emit EV_WRITE
that cycle — `EV_PRIM_DRAW` preempts it (see below). The xyz2_q /
xyzf2_q latch still updates. Trace consumers counting "vertices
seen" must sum `EV_WRITE`(selector=3 or 4) + `EV_PRIM_DRAW` to get
the true total.
- `EV_PRIM_DRAW` — Ch76 / Ch77. Fired by `gs_stub` once per primitive
completion: when an XYZ2 or XYZF2 vertex commit on the `gif_reg_*`
port closes a primitive under the current `PRIM[2:0]`. Preempts the
EV_WRITE that the closing vertex would otherwise have emitted.
Args: `arg0=PRIM[2:0]` (prim type), `arg1=primary threshold`,
`arg2=cumulative `prim_complete_count` post-increment`,
`arg3=closing vertex data` (the same 64 bits that latched into
xyz2_q / xyzf2_q on this cycle).
- **Discrete primitives** (POINT=1, LINE=2, TRIANGLE=3, SPRITE=2):
one draw per N vertices; the vertex counter resets to 0 after each
draw.
- **Strip / fan primitives** (LINE_STRIP=2, TRI_STRIP=3, TRI_FAN=3):
Ch77. Anchor on the first N vertices, then fire one draw per
additional vertex commit. The vertex counter saturates at the
primary threshold so every subsequent vertex closes another
primitive. Ch78 adds **vertex-identity tracking** distinguishing
TRI_STRIP rolling triangles `{v_n-2, v_n-1, v_n}` from TRI_FAN
pivot triangles `{v_pivot, v_n-1, v_n}` — see the next section.
- **Reserved** (PRIM=7): no draw, vertex commits do not increment
the counter, latches still update.
- A PRIM write always resets the vertex counter so a fresh
primitive type starts cleanly.
### Per-primitive vertex snapshot (Ch78)
Alongside `EV_PRIM_DRAW`, `gs_stub` exposes three 64-bit outputs —
`prim_v0_q`, `prim_v1_q`, `prim_v2_q` — that hold the *vertex tuple*
of the most recently closed primitive. Snapshot is registered on the
same clock edge as the `ev_valid` pulse and held until the next
`prim_complete`, so a TB can sample it at the same time it sees
`gs_ev_event == EV_PRIM_DRAW`.
The number of valid slots is implicit in `PRIM[2:0]`:
| `PRIM` | type | valid slots | semantics |
|---|---|---|---|
| 0 | POINT | `v0` | the single vertex |
| 1 | LINE | `v0`, `v1` | endpoints |
| 2 | LINE_STRIP | `v0`, `v1` | each segment uses `{v_n-1, v_n}` |
| 3 | TRIANGLE | `v0`, `v1`, `v2` | the three vertices |
| 4 | TRI_STRIP | `v0`, `v1`, `v2` | rolling: `{v_n-2, v_n-1, v_n}` |
| 5 | TRI_FAN | `v0`, `v1`, `v2` | pivot+rolling: `{v_pivot, v_n-1, v_n}` |
| 6 | SPRITE | `v0`, `v1` | top-left + bottom-right |
| 7 | reserved | — | observer never closes |
The TRI_STRIP-vs-TRI_FAN distinction lives entirely in the
saturated-extension path: a TRI_STRIP advances `v0` each draw with
the rolling window; a TRI_FAN pins `v0` to `v_pivot` (the first
vertex committed since the most recent PRIM write). On the *anchor*
draw, `v_pivot` and the rolling `v_prev` happen to coincide, so
TRI_STRIP and TRI_FAN report the same tuple for their first
triangle.
A PRIM write clears the rolling window (`v_curr` / `v_prev` /
`v_prev_prev` / `v_pivot` / `pivot_seen`) so a fresh primitive
context starts with no residual vertex bleed. Slots not used by the
current primitive type read `0`.
The snapshot tracks identity, not geometry — the values written are
the raw 64-bit `gif_reg_data` payloads of XYZ2 / XYZF2 commits, with
no decoding into screen-space coordinates. Rasterization is still
out of scope.
### Per-primitive color snapshot (Ch79 / Ch80)
`prim_color_q[63:0]` is registered on the same edge as
`prim_v0_q` / `prim_v1_q` / `prim_v2_q` and carries the value of
`rgbaq_q` at the moment the primitive closed. RGBAQ writes are
separate A+D entries from XYZ2 / XYZF2 commits (gif_packed_stub
serializes A+D to one accept per cycle), so `rgbaq_q` is always
settled to its draw-time value when `prim_complete_now` fires.
`prim_color_q` reads `0` if no RGBAQ has been written since reset;
`rgbaq_q` itself is **not** cleared on a PRIM write — color carries
forward across PRIM context switches, matching real GS behavior —
but it does reset to `0` on `rst_n`.
#### Per-vertex Gouraud color (Ch80)
For real game streams that interleave RGBAQ writes with vertex
commits to drive Gouraud shading, `gs_stub` exposes three
additional outputs:
| Output | Slot semantics |
|---|---|
| `prim_color_v0_q[63:0]` | color of vertex 0 |
| `prim_color_v1_q[63:0]` | color of vertex 1 |
| `prim_color_v2_q[63:0]` | color of vertex 2 |
A parallel rolling color window (`c_curr_q` / `c_prev_q` /
`c_prev_prev_q` / `c_pivot_q`, internal) samples `rgbaq_q` on
every vertex commit, mirroring the Ch78 vertex-identity window.
The snapshot layout matches the vertex layout exactly:
| `PRIM` | type | `_v0_q` color of | `_v1_q` color of | `_v2_q` color of |
|---|---|---|---|---|
| 0 | POINT | the single vertex | 0 | 0 |
| 1 | LINE | first endpoint | closing | 0 |
| 2 | LINE_STRIP | previous vertex | closing | 0 |
| 3 | TRIANGLE | `v_n-2` | `v_n-1` | closing |
| 4 | TRI_STRIP | `v_n-2` (rolls) | `v_n-1` | closing |
| 5 | TRI_FAN, anchor | `v1` (≡ pivot) | `v2` | `v3` |
| 5 | TRI_FAN, saturated | `v_pivot` (PINNED) | `v_n-1` | closing |
| 6 | SPRITE | first endpoint | closing | 0 |
`prim_color_q` is exactly the closing-vertex color (≡
`prim_color_v_close`), kept as a convenience alias for consumers
that don't care about Gouraud.
For **flat-shaded** primitives (RGBAQ written once before the
strip), all per-vertex color slots used by the primitive equal
each other and equal `prim_color_q`. For **Gouraud-shaded**
primitives (RGBAQ rewritten between vertex commits), the slots
may differ — capturing the per-vertex color identity needed to
distinguish a strip's rolling colors from a fan's pivot color.
The color window is **cleared on PRIM write** (unlike `rgbaq_q`
itself, which carries forward). This means per-vertex color
identity stays tied to the current primitive context — a stream
that switches PRIM types mid-context starts color tracking fresh
for the new context. Slots not used by the current primitive type
read `0`.
Like the vertex snapshot, this captures identity, not interpolated
geometry — the stored values are the raw 64-bit RGBAQ payloads
(packing R, G, B, A, and the texture-coord divisor Q together);
GS-style Gouraud interpolation across the primitive interior
remains out of scope.
### Structured-field decode (Ch81)
`gs_stub` exposes pre-decoded snapshot outputs alongside the raw
64-bit slots so a downstream rasterizer or pixel-emit path doesn't
have to re-derive bit fields:
| Output | Type | Carries |
|---|---|---|
| `prim_v0_decoded_q` / `_v1_` / `_v2_` | `trace_pkg::vertex_t` | `x` / `y` / `z` / `fog` / `is_xyzf2` per slot |
| `prim_v0_color_decoded_q` / `_v1_` / `_v2_` | `trace_pkg::color_t` | `r` / `g` / `b` / `a` / `q` per slot |
The decoded outputs latch on the same edge as the raw snapshots, so
a TB samples both atomically with `EV_PRIM_DRAW`.
#### vertex_t and the XYZ2 / XYZF2 distinction
```sv
typedef struct packed {
logic is_xyzf2; // 1 = XYZF2 source, 0 = XYZ2
logic [7:0] fog; // valid iff is_xyzf2; else 0
logic [31:0] z; // 32-bit (XYZ2) or zero-extended 24-bit (XYZF2)
logic [15:0] y; // 12.4 fixed-point screen Y
logic [15:0] x; // 12.4 fixed-point screen X
} vertex_t;
```
XYZ2 packs full 32-bit Z in `data[63:32]`. XYZF2 packs 24-bit Z in
`data[55:32]` and an 8-bit fog byte in `data[63:56]`. The `is_xyzf2`
flag is registered in a parallel rolling format-flag window
(`xyzf2_curr_q` / `xyzf2_prev_q` / `xyzf2_prev_prev_q` /
`xyzf2_pivot_q`) that tracks the source format of each vertex
through the rolling window — so when an XYZF2 vertex rolls into
the `v_prev` slot of a TRI_STRIP saturated extension, its
`is_xyzf2` flag rolls with it.
Cleared on `rst_n` and on PRIM write, same as the vertex/color
windows.
#### color_t
```sv
typedef struct packed {
logic [31:0] q; // texture-coord divisor (IEEE float)
logic [7:0] a;
logic [7:0] b;
logic [7:0] g;
logic [7:0] r;
} color_t;
```
Direct bit-slice of the RGBAQ payload — no interpretation. Q is
carried verbatim as a 32-bit IEEE float (the GS uses it for
texture coordinate division during rasterization, which remains
out of scope).
#### Decode helper functions
`trace_pkg` exposes `decode_vertex(data, is_xyzf2)` and
`decode_color(data)` so downstream code can re-decode raw 64-bit
values consistently with the `gs_stub` snapshot.
The decoded outputs are an additive contract — the raw `prim_v*_q`
and `prim_color_v*_q` outputs continue to work for consumers that
don't care about per-channel decoding.
### Minimal pixel emit (Ch82)
`gs_stub` exposes a per-primitive *pixel emit* — the smallest
possible output that ties the recognition layer to a framebuffer
destination. One pixel per closed primitive (the closing vertex,
in screen-space integer coords), addressed by the latched
`frame_1_q` register. No interpolation, no coverage, no
rasterization — this is the contact point for a future raster
chapter, not a substitute for one.
| Output | Width | Carries |
|---|---|---|
| `pixel_emit` | 1 | 1-cycle strobe; pulses on the same edge as `prim_complete` |
| `pixel_emit_count` | 32 | Running tally of emits since reset |
| `pixel_x_q` / `pixel_y_q` | 12 | Closing vertex integer screen coords (top 12 bits of 12.4 fixed-point) |
| `pixel_color_q` | 64 | RGBAQ at the emit moment (= `prim_color_q`) |
| `pixel_fbp_q` | 9 | `FRAME_1[8:0]` (framebuffer base / 2048) |
| `pixel_fbw_q` | 6 | `FRAME_1[21:16]` (framebuffer width / 64 in pixels) |
| `pixel_psm_q` | 6 | `FRAME_1[29:24]` (pixel storage format) |
| `pixel_fb_addr_q` | 32 | Computed VRAM byte offset (see below) |
#### Address arithmetic
```
fb_addr = FBP * 2048 + (Y * FBW * 64 + X) * bytes_per_pixel
```
Ch83 added PSM-aware `bytes_per_pixel` derived from the latched
`FRAME_1[29:24]` (PSM field):
| PSM (hex) | Format | bytes/pixel | Notes |
|---|---|---|---|
| 00, 01 | PSMCT32 / PSMCT24 | 4 | host-word |
| 02, 0A | PSMCT16 / PSMCT16S | 2 | |
| 13 | PSMT8 | 1 | indexed |
| 14 | PSMT4 | 4 here (host-word) | **legacy `pixel_emit` channel only** — see note below |
| 1B, 24, 2C | PSMT8H / PSMT4HL / PSMT4HH | 4 | host-word (high/low nibble of 32-bit slot) |
| 30, 31 | PSMZ32 / PSMZ24 | 4 | depth |
| 32, 3A | PSMZ16 / PSMZ16S | 2 | depth |
| other | — | 4 (host-word fallback) | unrecognized PSM |
This table describes the **legacy `pixel_emit` channel** (the
single-pixel-per-primitive debug strobe from Ch82/Ch83). That
channel does not commit to `vram_stub`; it only emits a trace
event. Its PSMT4 entry stays at host-word fallback — the
recognition layer never tracked sub-byte position there.
The **raster channel (`raster_pixel_emit`)** does NOT use this
table. It owns its own PSM-aware emit packing in S2 with full
PSMT4 support after Ch106:
- Byte address = `pixel_index >> 1` (overrides the
`pixel_index << ras_bpp_shift` form).
- The 4-bit index from R[3:0] is placed in the targeted nibble
(low/high keyed by `pixel_index[0]`) of `write_data[7:0]`.
- `raster_pixel_be_q = 4'b0001`, `raster_pixel_mask_q = 0x0F`
or `0xF0` so `vram_stub`'s per-bit merge updates only that
nibble.
PSMT8H / PSMT4HL / PSMT4HH still address the host 32-bit slot,
not the high/low byte/nibble within it; the extracted sub-byte
is rasterizer/blit-specific and out of scope here.
`pixel_psm_q` is still exposed verbatim so consumers can apply
their own sub-slot offset arithmetic if needed.
#### Carry-forward semantics
`frame_1_q` is part of the standard GIF-context register file and
carries forward across PRIM writes (matching real GS). A stream
that sets `FRAME_1` once and then emits multiple primitives
correctly addresses all of them. A stream that never writes
`FRAME_1` lands every pixel at `fb_addr=0` — observable but not
useful, behaves cleanly under reset.
`rgbaq_q` likewise carries forward, so `pixel_color_q` reflects
the most recent RGBAQ write at emit time. If a Gouraud-style
stream rewrites RGBAQ between vertices, `pixel_color_q` captures
the closing-vertex color — same semantic as Ch79's
`prim_color_q`.
#### Strobe channel, not trace event
`pixel_emit` is a dedicated 1-cycle strobe alongside the snapshot
outputs, not a multiplexed event on the main `ev_valid` trace
stream. This avoids contention with `EV_PRIM_DRAW` on the close
cycle. A consumer that wants both can sample on `pixel_emit`
posedge and read the snapshots atomically.
### Minimal interior rasterizer (Ch84)
`gs_stub` adds a *separate* per-interior-pixel emit channel
alongside the per-primitive `pixel_emit` of Ch82. The Ch82
strobe is unchanged (still pulses once per closed primitive); the
new channel pulses once per pixel that the rasterizer determines
is inside the closed primitive's interior.
| Output | Width | Carries |
|---|---|---|
| `raster_pixel_emit` | 1 | 1-cycle strobe per emitted interior pixel |
| `raster_pixel_emit_count` | 32 | Cumulative interior pixels emitted since reset |
| `raster_pixel_x_q` / `_y_q` | 12 | Integer screen coords of the emitted pixel |
| `raster_pixel_color_q` | 64 | Per-pixel color: Gouraud-interpolated R/G/B/A for TRI/TRI_STRIP/TRI_FAN (Ch86), flat (= `prim_color_q`) for SPRITE. Q passes through from the closing vertex. |
| `raster_pixel_fb_addr_q` | 32 | Computed VRAM byte offset (PSM-aware, same math as Ch82/Ch83) |
| `raster_active` | 1 | High while the FSM is scanning a primitive |
| `raster_overflow` | 1 | Latches if a new primitive closes while the 2-entry raster FIFO is full and no concurrent pop frees a slot (Ch87 + audit-medium fix). See "Raster command queue (Ch87)" below for the back-to-back-close budget. |
| `raster_degenerate` | 1 | Latches if a TRI/STRIP/FAN closes with zero signed area (3 colinear vertices). SCAN is skipped; SPRITE never sets this. |
#### Per-primitive coverage
| `PRIM` | Raster behavior |
|---|---|
| 0 POINT | No raster emit — Ch82 closing-pixel only |
| 1 LINE | No raster emit — Ch82 closing-pixel only |
| 2 LINE_STRIP | No raster emit — Ch82 closing-pixel only |
| 3 TRIANGLE | Bounding-box scan with edge-function half-plane test |
| 4 TRI_STRIP | Same engine as TRIANGLE, fires per closed strip triangle |
| 5 TRI_FAN | Same engine as TRIANGLE, fires per closed fan triangle |
| 6 SPRITE | Bounding-box rectangle fill (every pixel inside) |
| 7 reserved | No raster emit |
#### Triangle edge-function math
For each candidate pixel `p` and each edge `(vA, vB)` of the
triangle:
```
e(p) = (p.x - vA.x) * (vB.y - vA.y) - (p.y - vA.y) * (vB.x - vA.x)
```
32-bit signed math is used to avoid overflow at typical coord
ranges.
##### Top-left fill rule (Ch85)
Adjacent triangles that share an edge would double-paint pixels
on that edge under a naïve same-sign test. Ch85 applies the
standard D3D-style top-left fill rule so each shared-edge pixel
is owned by exactly one of the two triangles.
At the IDLE→SCAN transition the FSM:
1. Computes `signed_area = (v1-v0) × (v2-v0)`.
2. If `signed_area == 0` → degenerate (3 colinear vertices);
`raster_degenerate` latches and SCAN is skipped (no
raster pixels emit). The Ch82 `pixel_emit` and `prim_complete`
pulses still fire — only the interior raster is suppressed.
3. If `signed_area < 0` → CW winding; the FSM swaps `v1` and
`v2` so the rule applies uniformly to a CCW-ordered triangle.
4. For each edge of the post-swap CCW triangle, classifies it as
*top-or-left* (inclusive) or *right/bottom* (exclusive):
- **Top edge**: horizontal going right (`dy == 0 && dx > 0`).
- **Left edge**: going down in Y-down screen (`dy > 0`).
- Anything else is a right or bottom edge.
The inside test in SCAN becomes:
```
inside = (e[i] + bias[i] <= 0) for all i in {0, 1, 2}
```
where `bias[i] = 0` if edge `i` is top-or-left and `bias[i] = 1`
otherwise. The `+1` bias converts the strict `< 0` test for
right/bottom edges into a non-strict `<= 0` test on the biased
value, keeping the math integer and uniform.
Result: for any two adjacent triangles sharing an edge, the
edge's pixels are inclusive in exactly one triangle's bias
configuration and exclusive in the other's — no double-paint.
Some shared-corner pixels may end up unpainted by either
triangle. That's the standard top-left rule trade-off:
non-overlap takes priority over coverage of every boundary
pixel.
##### Per-pixel Gouraud color (Ch86)
Triangle interior pixels now use **per-pixel Gouraud color
interpolation** instead of flat shading. The three per-vertex
colors (the same Ch80 `prim_color_v0_q` / `prim_color_v1_q` /
`prim_color_v2_q` slot mapping) are latched at SCAN init with
the same `v1↔v2` swap mirror as the vertex coords, so the
post-swap CCW vertex order matches the latched color order.
For each interior pixel `p`, barycentric weights are derived
directly from the unbiased edge functions:
```
L0(p) = -e1(p) // weight for v0 = signed area of (p, v1, v2)
L1(p) = -e2(p) // weight for v1
L2(p) = -e0(p) // weight for v2
— L0 + L1 + L2 == sa for all p inside the triangle
```
For each color channel `ch` ∈ {R, G, B, A}:
```
ch_out(p) = (L0(p)*c0.ch + L1(p)*c1.ch + L2(p)*c2.ch) / sa
```
Q (the texture-coord IEEE float in c2's upper 32 bits) is **not**
interpolated — it passes through from the closing vertex's RGBAQ
unchanged.
For a flat-shaded primitive (RGBAQ written once before all three
vertices, all three vertex colors equal), `λ0+λ1+λ2 = 1` and
the formula collapses to `c0` exactly with no rounding error —
existing flat-shaded raster TBs (raster_basic, raster_topleft)
continue to pass.
The R/G/B/A division uses **integer truncation toward zero**.
Real PS2 GS uses fixed-point with specific rounding rules; the
recognition-layer stub is intentionally simpler. SPRITE keeps
flat shading (only 2 vertices, no barycentric weights defined).
#### Sprite rectangle fill
A SPRITE has two vertices forming opposite corners. The bounding
box is computed via `min`/`max` of each axis; every pixel inside
the box is emitted in row-major order.
#### FSM and scan timing
The FSM is `IDLE``SCAN`. On `prim_complete_now` for an eligible
primitive, the FSM latches the vertex tuple, color, FRAME_1
fields, and bounding box, then walks the box one pixel per cycle.
For each pixel: combinational inside-test → if inside, pulse
`raster_pixel_emit` and update the snapshot. Returns to `IDLE`
when `(ras_cur_x, ras_cur_y) == (x_max, y_max)`.
Color is **Gouraud-interpolated per pixel** for triangles
(Ch86) and **flat** for sprites — see the dedicated subsections
below for the fill-rule and Gouraud math. The closing-primitive
flat color (`prim_color_q`) is still used as the SPRITE fill
color and as a backward-compat reference for flat-shaded TRIs
(when all three vertex colors are equal, the Gouraud formula
reduces to that flat value with no rounding error).
Coordinates are **integer** — the 4-bit sub-pixel of 12.4
fixed-point is discarded. Sub-pixel edge adjustment is not
modeled (top-left fill rule IS modeled — see Ch85 subsection
above).
#### Raster command queue (Ch87) and `raster_overflow`
`gs_stub` has a **2-entry FIFO** in front of the SCAN FSM. Every
primitive close that targets the rasterizer (`RM_TRI` /
`RM_SPRITE`) snapshots its full per-prim context (vertices,
bias, signed area, per-vertex colors, FRAME_1 fields, bounding
box) into the queue at the close cycle. The FSM dequeues the
oldest entry whenever it's idle or finishing a scan. Effective
concurrency is **1 in-flight + 2 queued = up to 3 back-to-back
closes** absorbed without drop.
`raster_overflow` now latches when a 4th close arrives while the
FIFO is **full** (1 in-flight, both FIFO slots occupied). The
4th primitive is dropped. Earlier chapters' bound of "1 close
mid-scan = overflow" is replaced by Ch87's "3 closes
back-to-back = OK; 4th = overflow."
Degenerate triangles are **filtered at enqueue**: they set
`raster_degenerate` and are not pushed into the queue. SPRITE
never sets `raster_degenerate`. POINT/LINE/LINE_STRIP don't
raster (RM_NONE) — they don't enqueue at all and the queue
ignores them.
Pop happens at `IDLE``SCAN` AND at drain-done (Ch88; see below)
when the queue has more work, so back-to-back scans run
contiguously without an `IDLE` bubble. `raster_active` stays
high across the boundary.
Real PS2 game streams emit thousands of primitives back-to-back;
3-deep concurrency is enough for most TRI_STRIP / TRI_FAN
patterns with small bounding boxes. Larger sprites or larger
triangles increase scan length and reduce headroom — a future
chapter can grow the FIFO depth.
#### Pixel pipeline (Ch88)
The SCAN body is **3 stages, throughput 1 candidate pixel per
cycle**:
| Stage | Source | Work |
|-------|--------|------|
| **S0** | `ras_cur_x` / `ras_cur_y` (bbox walker) | Generate the next candidate coord; advance the bbox walker; on bbox corner, fire `ras_at_end_of_s0` and transition R_SCAN→R_DRAIN. |
| **S1** | `s1_x_q` / `s1_y_q` (registered) | Combinational edge functions on `(s1_x, s1_y)` against the three triangle edges (or trivial-true for SPRITE), top-left bias, inside test → `s1_pixel_inside`. Latched into `s2_inside_q`. |
| **S2** | `s2_x_q` / `s2_y_q` / `s2_L0..L2_q` / `s2_inside_q` | Compute Gouraud `interp_byte(λ_i, c_i)` × 4 channels and `s2_fb_addr` from PSM/FBP/FBW. If `s2_valid_q && s2_inside_q`, drive `raster_pixel_emit` with the resolved fb_addr / x / y / color. |
`raster_state` is now a 3-state FSM:
- **R_IDLE** — no work; `pop_ok` fires on a non-empty FIFO.
- **R_SCAN** — S0 produces one valid coord per cycle; S1/S2
latches propagate. On bbox corner, transitions to R_DRAIN.
- **R_DRAIN** — S0 stops producing valids (`s1_valid_q <= 0`);
S1 and S2 finish their in-flight pixels. When both pipeline
valids are low (`drain_done`), the FSM either pops the next
primitive (back-to-back contiguous SCAN) or returns to R_IDLE.
`pop_ok = !fifo_empty && (R_IDLE || drain_done)` — the
end-of-scan pop is now drain-done, three cycles after S0
produces the corner. This guarantees the pipeline-tail pixels
of the previous primitive are not overwritten by the next
primitive's pop, while still keeping `raster_active` high
across the seam.
Latency from `pop_ok` to first registered `raster_pixel_emit`
is **3 stages of pipeline + 1 cycle of FIFO turnaround + 1
cycle of registered emit output = 5 posedges from the close
cycle of the closing vertex** (see
`sim/tb/gif_gs/tb_gs_raster_pipeline.sv` for the cycle-exact
contract).
- `EV_MODE` — fired for any accept that did not resolve to a tracked
register: REGLIST entries, IMAGE/DISABLE payload qwords, NOP-nibble
PACKED slots, unknown privileged offsets, unknown GIF reg numbers.
Reserved for "we know we saw something, we are intentionally not
modeling it yet."
- `EV_GIFTAG` — one per accepted GIFtag; carries `flg`/`nreg`/`nloop`/`eop`
for stream-level checking.
When trace event semantics change, audit this section and the per-stub
trace-schema header comments together.
#### VRAM persistence (Ch89)
`vram_stub` (`rtl/gif_gs/vram_stub.sv`) is the **first persistence
layer** the rasterizer has had. Every `raster_pixel_emit` pulse
writes 4 bytes of pixel data at `raster_pixel_fb_addr_q` into
`vram_stub`'s linear byte array. A combinational debug read port
exposes `read_data` byte-addressably so testbenches can verify
storage.
Wiring:
| vram_stub port | gs_stub source |
|---|---|
| `write_en` | `raster_pixel_emit` |
| `write_addr` | `raster_pixel_fb_addr_q` |
| `write_data` | `raster_pixel_color_q[31:0]` (the lower 32 bits — Q in the upper 32 is not framebuffer data) |
| `write_be` | `raster_pixel_be_q` (Ch95) — per-byte write enable: byte i (the byte at `write_addr + i`) is committed only when `write_en && write_be[i]`. Lets the same 32-bit write port serve PSMs of any byte width. |
| `write_mask` | `raster_pixel_mask_q` (Ch106) — per-bit merge mask: for each enabled byte, `mem[i] <= (mem[i] & ~mask_i) | (data_i & mask_i)`. Tied to `0xFFFFFFFF` for PSMs ≥ 1 byte/pixel (no behavior change). PSMT4 drives `0x0000_000F` or `0x0000_00F0` to preserve the un-targeted nibble in the same byte. |
Scope (current write-side support, after Ch105):
- **PSMCT32 + PSMCT16 + PSMT8** at the raster write port. The PSM
width is selected by `gs_stub`'s `bpp_shift` mux off
`FRAME_1.PSM` and surfaced as `raster_pixel_psm_q`; `gs_stub`'s
S2 packs the pixel into the right byte lane and drives
`raster_pixel_be_q` so `vram_stub` commits exactly the right
bytes:
- PSMCT32 (PSM=0x00) → 4 bytes/pixel, `be = 4'b1111`, ABGR in
`write_data[31:0]`.
- PSMCT16 (PSM=0x02) → 2 bytes/pixel, `be = 4'b0011`, RGB5A1
packed in `write_data[15:0]` (Ch95). `write_addr` is the
halfword byte address — per-byte `be` makes unaligned
halfword writes safe.
- PSMT8 (PSM=0x13) → 1 byte/pixel, `be = 4'b0001`, the natural
ABGR's R channel goes into `write_data[7:0]` as the PSMT8
index (Ch105). `write_addr` is the exact byte address;
`vram_stub` commits `mem[write_addr] ← write_data[7:0]` at
any byte alignment without needing data-lane shifting.
- PSMT4 (PSM=0x14) → 0.5 bytes/pixel (2 pixels per byte),
`be = 4'b0001`, `write_mask = 0x0000_000F` (low nibble) or
`0x0000_00F0` (high nibble) per `pixel_index[0]`. The 4-bit
index (low nibble of natural ABGR's R) is placed in the
targeted nibble position in `write_data[7:0]`. vram_stub
merges only the masked bits — the OTHER nibble of the same
byte is preserved (Ch106). Back-to-back same-byte emits
(e.g. PSMT4 pixels x=0 and x=1, both landing in byte 0)
chain through NBA semantics: the second NBA samples
mem[addr] AFTER the prior commit, so both nibbles end up in
the byte without a bypass-forwarding net.
- PSMCT24 / PSMCT16S / PSMZ32 / PSMZ24 / PSMZ16 / PSMZ16S /
PSMT8H / PSMT4HL / PSMT4HH — `bpp_shift` falls through to a
host-word default (4 bytes); raster emit through these PSMs
is not contract-tested.
- **Write-side addressing**. Real PS2 VRAM is 4 MiB organized
into pages × blocks × columns per PSM. By DEFAULT, both
`gs_stub` raster emit and `gif_image_xfer_stub` TRXDIR uploads
produce the linear-framebuffer layout PCSX2 calls "linear PSM".
Optional per-PSM swizzle paths gated by parameters on each
module:
* **PSMCT32**: `PSMCT32_SWIZZLE` parameter on `gs_pcrtc_stub`
(Ch120 read-side), `gif_image_xfer_stub` (Ch121 image-xfer
write-side), and `gs_stub` (Ch122 raster write-side).
* **PSMCT16**: `PSMCT16_SWIZZLE` parameter on `gs_pcrtc_stub`
(Ch126 read-side), `gif_image_xfer_stub` (Ch127 image-xfer
write-side), and `gs_stub` (Ch128 raster write-side). All
three integration points live, mirroring the PSMCT32 trio.
When on, byte addresses route through the per-PSM swizzle module
(`gs_swizzle_psmct32_stub` / `gs_swizzle_psmct16_stub`); image-xfer
adds `dest_base_q = DBP*256` on top of the swizzle output so any
DBP works, while raster emit feeds the active `ras_fbp` directly
so the swizzle output is already the absolute address. Per-PSM
parameters are independent — enabling one doesn't affect the
other PSM. **PSMT8** has its full three-path swizzle integration
as of Ch134, mirroring the PSMCT32/PSMCT16 trios: standalone
math primitive `gs_swizzle_psmt8_stub` (Ch131) wired into
`gs_pcrtc_stub` (Ch132 read-side, `PSMT8_SWIZZLE`),
`gif_image_xfer_stub` (Ch133 write-side), and `gs_stub` (Ch134
raster emit) — same parameter name on all three modules.
**PSMT4** has its full three-path swizzle integration as of
Ch140, mirroring the PSMCT32/PSMCT16/PSMT8 trios: standalone
math primitive `gs_swizzle_psmt4_stub` (Ch137) wired into
`gs_pcrtc_stub` (Ch138 read-side, `PSMT4_SWIZZLE`),
`gif_image_xfer_stub` (Ch139 write-side), and `gs_stub` (Ch140
raster emit) — same parameter name on all three modules. The
PSMT4 paths additionally thread the swizzle module's
`nibble_hi` output through the existing Ch106 (raster) /
Ch118 (image-xfer) nibble RMW machinery (replacing
`s2_pixel_index[0]` / `x_eff[0]` as the high/low nibble
selector when the gate is on). All parameter defaults are 0,
so existing TBs see the legacy linear behavior. **All four
common GS PSMs (CT32 + CT16 + T8 + T4) now have COMPLETE
three-path swizzle integration foundation.**
- **Stub-sized**. Default `BYTES = 65536`. Real VRAM is 4 MiB; for
TB purposes a small linear region is enough to verify that
emitted pixels actually land at the addresses gs_stub computes.
- **Scanout path** is provided by `gs_pcrtc_stub` (Ch90 — see
below). The legacy `platform_video_stub` flood-fills BGCOLOR
and is unaware of VRAM; TBs that want to verify the round trip
use `gs_pcrtc_stub` instead.
The Ch89 white-box TB `tb_gs_vram_writeback.sv` exercises the
contract end-to-end: drive a 4×4 SPRITE through gs_stub, capture
the (fb_addr, color) of each `raster_pixel_emit` pulse, then
read each fb_addr back from `vram_stub` and assert byte-exact
match.
#### PCRTC scanout (Ch90)
`gs_pcrtc_stub` (`rtl/gif_gs/gs_pcrtc_stub.sv`) is the **scanout
side** of the GS pipeline — its dual is `gs_stub` (the write
side). It models a minimal PCRTC (Programmable CRT Controller):
runs its own raster timing, generates a VRAM read address from
the current `(hcnt, vcnt)` using the same fb_addr math as
gs_stub, reads the byte returned by `vram_stub`'s combinational
debug port, and drives `r`/`g`/`b` for the active area. Together
with Ch88's pipeline + Ch89's VRAM, this closes the loop:
```
raster_pixel_emit → vram_stub.write → vram_stub.read → pcrtc.r/g/b
```
Configuration (Ch91 — privileged-block CPU MMIO):
`gs_pcrtc_stub` consumes two real PS2 GS privileged display
register latches directly from `gs_stub`:
| pcrtc input | gs_stub source | Layout |
|---|---|---|
| `pmode_q[63:0]` | privileged write at offset 0x0000 | bit 0 = EN1 (display 1 enable) |
| `dispfb1_q[63:0]` | privileged write at offset 0x0070 | FBP[8:0], FBW[14:9], PSM[19:15], DBX[42:32] (Ch91-audit), DBY[53:43] (Ch91-audit) |
| `display1_q[63:0]` (Ch92, Ch93) | privileged write at offset 0x0080 | DX[11:0], DY[22:12], MAGH[26:23] (Ch93 — H scale = MAGH+1), MAGV[28:27] (Ch93 — V scale = MAGV+1), DW[43:32] (width-1), DH[54:44] (height-1) |
The Ch90 sideband ports (`scanout_enable` / `dispfb_fbp` /
`dispfb_fbw`) are **gone**. TBs program scanout the way a real
PS2 driver would: write DISPFB1, then DISPLAY1, then PMODE.EN1=1
(Ch92). Out of reset, all three registers are 0, so EN1 is low
and pcrtc outputs 0.
`scanout_enable` inside pcrtc is derived combinationally from
the latches:
`scanout_enable = pmode_q[0] & (PSM ∈ {0, 2, 0x13, 0x14})`.
PSMCT32 (=0), PSMCT16 (=2), PSMT8 (=0x13), and PSMT4 (=0x14) are
honored at this scope; any other PSM forces scanout off rather
than mis-decoding the byte layout.
DISPLAY1 (Ch92, Ch93) supplies the **display window** — the
sub-rect inside the active area where pcrtc actually pulls
pixels from VRAM — and the **per-axis magnification**: each
VRAM column is shown for (MAGH+1) consecutive VCK pulses, each
VRAM line for (MAGV+1) raster lines. Outside the window pcrtc
drives r/g/b = 0 even with EN1=1. Pcrtc's H_TOTAL/V_TOTAL still
come from module parameters at instantiation; only the
active-area sub-rect gated by DX/DY/DW/DH is register-driven.
Dual-display (PMODE.EN2 + DISPFB2 + DISPLAY2) is deferred.
Address math + display-window gating + magnification:
```
hmag_factor = MAGH + 1 // 1..16
vmag_factor = MAGV + 1 // 1..4
hwin_rel = hcnt - DX // pixel offset inside the window
vwin_rel = vcnt - DY
in_window = (hcnt >= DX) && (hwin_rel <= DW)
&& (vcnt >= DY) && (vwin_rel <= DH)
fbp_bytes = dispfb_fbp << 11 // FBP × 2048
pixels_per_row = dispfb_fbw << 6 // FBW × 64
vram_x_unshift = hwin_rel / hmag_factor // 4 displayed pixels = 1 VRAM column at MAGH=3
vram_y_unshift = vwin_rel / vmag_factor
effective_x = vram_x_unshift + DBX
effective_y = vram_y_unshift + DBY
pixel_index = effective_y × pixels_per_row + effective_x
bpp_shift = (PSM == PSMCT32) ? 2 :
(PSM == PSMCT16) ? 1 :
(PSM == PSMT8) ? 0 : 2
fb_addr = fbp_bytes + (pixel_index << bpp_shift)
r/g/b drive = (de && scanout_enable && in_window) ? decode(VRAM, PSM) : 0
```
Per-PSM color decode at `vram_read_data`:
- **PSMCT32**: `r = data[7:0]`, `g = data[15:8]`, `b = data[23:16]`. Alpha at `[31:24]` is dropped (no DAC channel).
- **PSMCT16** (Ch94): RGB5A1 packed into the lower 16 bits as `{A[15], B[14:10], G[9:5], R[4:0]}`. 5→8 expansion uses bit-replicate `r8 = {r5, r5[4:2]}` (so 5'h1F → 8'hFF, 5'h00 → 8'h00). Alpha bit dropped at the DAC.
- **PSMT8** (Ch96/Ch97): index in `data[7:0]`. With `clut_enable=1` (Ch97), pcrtc presents `clut_read_idx = idx + (CSA << 4)` to the external `clut_stub` and decodes the returned PSMCT32 entry as `r = clut_data[7:0]`, `g = clut_data[15:8]`, `b = clut_data[23:16]`. With `clut_enable=0` (Ch96 fallback), pcrtc surfaces the index as grayscale so the 8-bit storage lane is visually verifiable without programming a CLUT.
- **PSMT4** (Ch103): 2 pixels per byte. `byte_offset = pixel_index >> 1` (overrides the standard `pixel_index << bpp_shift` math). `nibble = pixel_index[0] ? data[7:4] : data[3:0]` picks the 4-bit pixel; the zero-extended 8-bit value `{4'd0, nibble}` plus `(CSA << 4)` is presented on `clut_read_idx`. With `clut_enable=1`, pcrtc decodes the returned PSMCT32 entry the same way as PSMT8. With `clut_enable=0`, the fallback replicates the nibble across the 8-bit DAC value (`r = g = b = {nibble, nibble}`) so 4'hF → 0xFF and 4'h5 → 0x55. CSA is the natural per-palette-window selector for PSMT4 — multiple 16-entry palettes can share the 256-entry staging area, indexed by CSA.
**Ch95 — gs_stub raster channel emits PSMCT16**. The S2 stage
of the pipeline now packs ABGR → RGB5A1 (`r5=R[7:3]`, `g5=G[7:3]`,
`b5=B[7:3]`, `a1=A[7]`) when `ras_bpp_shift==1` (PSMCT16 / PSMCT16S
/ PSMZ16 / PSMZ16S — any 16-bit PSM). The packed 16-bit pixel
goes in the LOW halfword of `raster_pixel_color_q[31:0]`, and a
new `raster_pixel_be_q[3:0]` selects which bytes vram_stub
commits: `4'b0011` for PSMCT16, `4'b1111` for PSMCT32. vram_stub
gates each byte write on `write_be[i]`, so back-to-back PSMCT16
emits write 2 bytes each without halfword stomping. New
`raster_pixel_psm_q[5:0]` exposes the current PSM for trace.
The Ch95 TB `tb_gs_raster_psmct16.sv` exercises the round trip:
gs_stub renders a 4×4 SPRITE with FRAME_1.PSM=PSMCT16, then VRAM
read-back verifies each pixel landed at the right halfword AND
that the halfword right after the sprite stays zero (no leak).
Ch105 extends the raster channel to PSMT8 (FRAME_1.PSM=0x13).
When `ras_bpp_shift==0`, S2 takes the natural ABGR's R channel
(low 8 bits) as the PSMT8 index — the same lane real PS2 hardware
writes when the destination FB is PSMT8 — places it in the LOW
byte of the emit lane, and sets `raster_pixel_be_q = 4'b0001` so
vram_stub commits exactly the 1 byte at fb_addr. The 1-byte
commit works at any byte alignment because vram_stub gates each
byte lane independently. The Ch105 TB `tb_gs_raster_psmt8.sv`
renders a 5×3 SPRITE (chosen so the row spans byte lanes 1, 2, 3,
0, 1 — exercising every lane alignment) at FRAME_1.PSM=PSMT8 with
RGBAQ R=0x55, G=0xAA, B=0xBB, A=0xCC; asserts each sprite byte
reads back as 0x55, the bytes immediately left and right of the
sprite stay 0x00 (so 32-bit-aligned overwrite would be visible),
and a full-VRAM sweep finds NO byte equal to 0xAA / 0xBB / 0xCC
(channel-isolation: only R reaches VRAM at PSMT8).
Ch106 closes the indexed-write gap with PSMT4 (FRAME_1.PSM=0x14)
as a per-bit RMW into `vram_stub`. Three changes form the
mechanism:
1. `vram_stub` gains a new `write_mask[31:0]` input (Ch106). The
commit is now `mem[i] <= (mem[i] & ~mask_i) | (data_i & mask_i)`
for each enabled byte. PSMCT32/16/PSMT8 tie mask=`0xFFFF_FFFF`
(no behavior change — full byte writes).
2. `gs_stub`'s S2 PSM-aware emit packing gets a PSMT4 branch:
the byte address is `pixel_index >> 1` (overrides the
`pixel_index << ras_bpp_shift` form), the index is the low
4 bits of the natural ABGR's R channel, and the emit places
that 4-bit value in either the low (`{4'd0, idx}`) or high
(`{idx, 4'd0}`) nibble of `write_data[7:0]` based on
`pixel_index[0]`. `s2_emit_be = 4'b0001`,
`s2_emit_mask = pixel_index[0] ? 0x0000_00F0 : 0x0000_000F`.
3. New `raster_pixel_mask_q[31:0]` output on `gs_stub` carries
the mask through to `vram_stub.write_mask`.
The Ch106 TB `tb_gs_raster_psmt4.sv` is intentionally
adversarial about preservation. VRAM is preloaded with `0xA5`
(high=A, low=5) at every byte the sprites will touch. Three
phases:
- **Phase A**: 4×2 SPRITE at (0,0)..(3,1), R=0x05 → idx=5. Both
nibbles of each enclosing byte are written (8 emits across 4
bytes); each byte ends at `0x55` and the four neighbouring
preloaded bytes (2..3, 34..35) remain `0xA5`. This proves the
back-to-back same-byte case (NBA chaining) and the neighbour-
byte preservation in one go.
- **Phase B**: single-pixel SPRITE at (5, 2). x=5 odd → high
nibble; pixel_index = 133, byte_addr = 66; idx=7. Preload
`mem[66] = 0xA5`. Expected after raster: `mem[66] = 0x75`
high nibble updated from A to 7, low nibble stays 5. Proves
isolated high-nibble RMW preserves the low nibble.
- **Phase C**: single-pixel SPRITE at (4, 3). x=4 even → low
nibble; pixel_index = 196, byte_addr = 98; idx=9. Preload
`mem[98] = 0xA5`. Expected after: `mem[98] = 0xA9` — low
nibble updated from 5 to 9, high nibble stays A. Proves
isolated low-nibble RMW preserves the high nibble.
Continuous observer asserts `psm_q == 6'h14`, `be_q == 4'b0001`,
and `mask_q ∈ {0x0F, 0xF0}` on every emit. Final aggregate
checks: 10 emits total, full-VRAM sweep finds NO byte equal to
0xAA / 0xBB / 0xCC (only R reaches the framebuffer at PSMT4).
DBX / DBY shift the VRAM origin: the pixel that appears at
displayed (DX, DY) corresponds to VRAM (DBX, DBY). Real PS2
drivers use this for double-buffered framebuffers (alternate
frames at different DBX/DBY) and offset display windows.
Five TBs lock these contracts:
- `tb_gs_scanout_basic.sv` — DBX=DBY=0, DISPLAY1 covers full
active area, MAGH=MAGV=0 (1×): classic sprite-at-origin scanout.
- `tb_gs_scanout_dbx_dby.sv` — sprite at VRAM (4,2)..(7,5),
DISPFB1.DBX=4/DBY=2, DISPLAY1 full active area, MAGH=MAGV=0:
sprite shows at displayed (0..3, 0..3).
- `tb_gs_scanout_display_window.sv` — sprite at VRAM (0..3, 0..3),
DBX=DBY=0, DISPLAY1 with DX=2/DY=1/DW=3/DH=3, MAGH=MAGV=0:
sprite shows at displayed (2..5, 1..4); pixels outside the
window are black even though pcrtc's raster passes through them.
- `tb_gs_scanout_magh_magv.sv` (Ch93) — sprite at VRAM (0..3, 0..3),
DBX=DBY=0, DISPLAY1 with DX=4/DY=2/DW=7/DH=7, MAGH=1/MAGV=1
(2×/2×): 4×4 VRAM sprite stretches to fill the 8×8 displayed
window pixel-perfect; pixels outside the window are black.
- `tb_gs_scanout_psm16.sv` (Ch94) — 4×4 RGB5A1 sprite written
directly to vram_stub at PSMCT16 byte stride, DISPFB1.PSM=0x02:
5→8 bit-replicate decode produces the right (R8, G8, B8) at
scanout. (No gs_stub instantiated; this TB exercises the PSM
decode path in isolation.)
- `tb_gs_scanout_psmt8.sv` (Ch96) — 4×4 PSMT8 sprite of indices
0x10..0x1F written directly to vram_stub at 1 byte/pixel
stride. DISPFB1.PSM=0x13, DISPLAY1 with DX=4/DY=2/DW=7/DH=3
AND MAGH=1 (2× horizontal). Asserts each scan-out displayed
pixel reads back as grayscale R=G=B=expected index, proving
byte stride + display window + horizontal magnification all
work at 1 byte/pixel.
- `tb_gs_scanout_psmt8_clut.sv` (Ch97) — same 4×4 PSMT8 sprite,
plus a programmed CLUT where `CLUT[i] = ABGR(0xFF, i+0x80, i+0x40, i)`.
With `clut_enable=1` and `clut_csa=0`, asserts each scan-out
pixel reads back as the CLUT entry for its index — PSMT8
storage + CLUT lookup compose correctly into real RGB. Three
phases: full-frame CSA=0, single-pixel CSA=1 (idx 0x00 →
CLUT[0x10]), and CSA=1 wrap (idx 0xF8 → CLUT[0x08]).
- `tb_gs_tex0_clut.sv` (Ch98) — drives gs_stub's GIF reg# 0x06
(TEX0_1) and asserts the latch + sub-field decoders match the
encoded payload (CBP/CPSM/CSM/CSA/CLD bit ranges). Phase 2
wires `pcrtc.clut_csa` from `gs_stub.tex0_1_csa_q` (instead
of TB-side sideband) and verifies the CSA value flows from a
GIF register write into the CLUT lookup math at scan-out.
- `tb_gs_clut_load.sv` (Ch99) — full TEX0.CLD-driven VRAM→CLUT
load round trip. Stages 256 PSMCT32 entries in VRAM at
`CBP*256` (using the new `vram_stub` second read port), drives
TEX0_1 with `CBP=4, CPSM=PSMCT32, CSM=CSM2, CLD=1`, waits for
`clut_loader_stub.load_busy` to fall, then runs PSMT8 scanout
and asserts each in-sprite pixel reads back as the CLUT entry
the loader copied — no TB-direct CLUT writes needed. Also
carries a Ch99-audit negative phase: a TEX0 write with CSM=0
(CSM1 swizzle, deferred) silently no-ops instead of laying
down wrong linear bytes.
- `tb_gs_clut_load_ct16.sv` (Ch100) — CPSM=PSMCT16 variant of the
Ch99 load TB. Stages 256 RGB5A1 entries (2 bytes each) in VRAM
at `CBP*256`, drives TEX0_1 with `CPSM=2`. The loader now
walks at 2-byte stride, unpacks RGB5A1 → PSMCT32 ABGR via 5→8
bit-replicate, and writes to clut_stub. PSMT8 scanout produces
the expanded RGB. Ch100-audit alpha coverage: per-entry `a1 = idx[0]`
varies the alpha bit so both `{8{0}} = 0x00` and `{8{1}} = 0xFF`
are exercised; a TB-side `clut_we` snoop captures every loader
write so alpha can be asserted directly without going through
the RGB-only scanout path.
- `tb_gs_clut_load_cld_modes.sv` (Ch101 + Ch102) — conditional
CLD-mode policy. Phases walk through CLD ∈ {0, 1, 2, 3, 4, 5,
6, 7} with varying CBP/CPSM/CSA, counting `loader_busy` rising
edges to prove: CLD=0 never loads; CLD=1 always (full); CLD=2
loads only when CBP changed; CLD=3 loads when CBP/CPSM/CSA
any-changed (CBP, CSA, and CPSM arms each isolated); CLD=4
always loads but only the 16-entry CSA window (Ch102 — write
range correctness is locked by `tb_gs_clut_load_csa_window`);
CLD ∈ {5, 6, 7} reserved no-ops.
- `tb_gs_clut_load_csa_window.sv` (Ch102) — CLD=4 write-range
correctness. Phase 1 stages 256 distinct PSMCT32 entries in
VRAM and runs CLD=1 to fill all 256 CLUT slots with pattern_a.
Phase 2 stages 16 different entries at a new CBP, drives CLD=4
with CSA=2 (window = idx 32..47), and asserts via a `clut_we`
snoop that exactly 16 writes occurred AND the captured array
contains: pattern_a(i) at i ∈ [0, 32) [48, 256), pattern_b(i-32)
at i ∈ [32, 48). Proves 240 entries are preserved across the
partial load. Audit-low extensions: Phase 3 covers the
high-CSA wrap (CSA=16 → window-base wraps mod-256 to 0); Phase
4 covers CT16 partial (CPSM=PSMCT16, 2-byte stride, RGB5A1
unpack at the loader, window at idx 160..175).
- `tb_gs_scanout_psmt4_clut.sv` (Ch103) — PSMT4 scanout. Stages
a 4×4 PSMT4 sprite (2 pixels/byte) and 16 CLUT entries.
Phase 1 (`clut_enable=1`): asserts each pixel reads
`CLUT[zero-ext(nibble) + CSA*16]`. Phase 2 (`clut_enable=0`):
asserts the grayscale fallback replicates the 4-bit nibble
across the 8-bit DAC value. Both phases verify byte-stride
half-extraction (low/high nibble pick) at every active pixel.
Audit-low Phase 3 locks PSMT4 + nonzero CSA (CSA=1, window
16..31) end-to-end: TB-direct CLUT writes plant a 0xDEAD_BEEF
sentinel at entries 0..15 and a per-index formula at 16..31,
scanout asserts each pixel reads the formula and never the
sentinel.
- `tb_gs_demo_psmt4_e2e.sv` (Ch107) — first end-to-end demo for
the GS/PCRTC stack. **Scope is GS-side only**: the post-GIF
register stream (per-reg A+D writes via `gs_stub.gif_reg_*`)
plus privileged-block MMIO drive the pipeline; `gif_packed_stub`
/ GIFtag-PACKED is BYPASSED — feeding the same demo through
the GIF front-end is a future chapter. Step 1 stages 16
PSMCT32 palette entries in VRAM at `CBP*256` (modelled as a
TB-direct write — DMA→GS image transfer is a future chapter,
but the framebuffer itself is NOT TB-direct). Step 2 drives
per-reg writes (PRIM/FRAME_1/RGBAQ/XYZ2) for four SPRITEs
paying out a 4-quadrant 8×4 image (TL idx 0x5, TR idx 0x7,
BL idx 0xA, BR idx 0xC) at FRAME_1.PSM=PSMT4 — all 32
framebuffer pixels arrive via the Ch106 raster channel.
Step 3 drives TEX0_1 with `CBP=palette, CPSM=PSMCT32,
CSM=CSM2, CSA=0, CLD=4`; loader writes clut_stub[0..15].
Step 4 brings up scanout via privileged-block writes to
DISPFB1 (PSM=PSMT4) + DISPLAY1 + PMODE.EN1. Step 5 captures
one full frame and asserts each pixel reads back as
`CLUT[quadrant_idx]` (or `CLUT[0]` outside the 8×4 image
since vram_stub zero-init means nibble=0). Aggregate asserts:
32 PSMT4 emits, mask ∈ {0x0F, 0xF0} on every emit
(channel-isolation locked architecturally — only R[3:0] ever
reaches VRAM at PSMT4), loader fires exactly once, no
raster_overflow / raster_degenerate. This TB is the first
stack-wide proof that the GS-side post-GIF sequence —
per-reg writes → indexed framebuffer → TEX0+CLD palette
upload → PMODE/DISPFB/DISPLAY scanout — produces a coherent
RGB frame end to end without TB sideband for the framebuffer
pixels. Routing the same primitives through GIFtag/PACKED A+D
via `gif_packed_stub` closes the last sideband and is the
natural Ch108 anchor.
- `tb_gs_demo_psmt4_e2e_ee_full_bootlet.sv` (Ch114) — extends
Ch113's EE-driven control plane to ALSO drive the DMAC
channel-2 setup from the same MIPS instruction stream. The EE
program now writes the 4 GS-priv registers + the 3 DMAC ch2
registers (MADR / QWC / CHCR.start) via real `sw`
instructions, then SYSCALLs to halt. Total: 7 EE-CPU MMIO
writes (4 GS-priv + 3 DMAC) producing the same 16×8 captured
frame. **Architectural note**: the program lives in
`bios_rom_stub` at 0xBFC0_0000 / phys 0x1FC0_0000, NOT in
RAM. A RAM-resident program would have its instruction
fetches contend with the DMAC's RAM reads through
`ee_ram_stub`'s single read port (the map's CPU>DMAC
arbitration silently corrupts DMAC data). Putting the program
in BIOS decouples the two paths so EE and DMAC run truly in
parallel. This also matches real PS2: the EE boots out of
BIOS ROM. PASS criteria add to Ch113's: **3 EE-driven DMAC
writes** seen at the map's DMAC-ch2 decode; the existing
`dma=(1,36,1)` event taxonomy still holds (those events are
triggered by the EE's CHCR write, not a TB-direct write).
The remaining TB-direct surfaces in the demo are now narrowly
the GIF payload pre-stage in RAM (a real EE driver would
itself stage this) and bios_rom_stub's program preload (which
is the EE bootlet itself — not a runtime TB sideband).
- `tb_gs_demo_psmt4_e2e_ee_program.sv` (Ch113) — same demo as
Ch112 but the 4 control-plane MMIO writes (PMODE / DISPFB1 /
DISPLAY1 lo / DISPLAY1 hi) are no longer issued by the TB
directly. Instead a 10-instruction MIPS program preloaded into
ee_ram_stub at phys 0x800 (kseg0 0x80000800) is fetched and
executed by `ee_core_stub` (parameterized with
`PC_RESET=0x80000800`). The program is `LUI/ORI/SW × 4` plus a
SYSCALL terminator; the SW instructions target `0x12000000+`
and flow through `ee_memory_map_stub`'s GS-priv decode →
`ee_gs_priv_bridge_stub``gs_stub.reg_wr_*`. Closes the
very last TB-direct surface in the demo flow: every byte AND
every register bit AND every control-plane decision now
arrives from a real-shape source. PASS criteria add to
Ch112's: `core_halt_o == 1` (asserts exactly once on the
SYSCALL halt), `core_trap == 0`, EE program halts at
`EE_PROG_VA + 36 = 0x80000824` (the SYSCALL slot). The TB
still pre-stages the GIF payload and triggers the DMAC
channel-2 transfer via TB-direct CHCR/MADR/QWC writes — a
wider EE program that also drives DMAC bring-up is a
separate future chapter.
- `tb_gs_demo_psmt4_e2e_eemap.sv` (Ch112) — same demo as Ch111
but the bridge is no longer driven by the TB directly. Instead
the TB drives `ee_memory_map_stub.ee_wr_*` with full 32-bit
physical addresses targeting the new GS-privileged-MMIO window
at 0x1200_0000-0x1200_FFFF (64 KiB; phys[28:16] == 13'h1200).
The map decodes the window, peels the 16-bit offset, and hands
the 32-bit half-write to `ee_gs_priv_bridge_stub`, which then
fires gs_stub.reg_wr_* with the running 64-bit shadow value.
Closes the last control-plane routing gap before a real EE
instruction stream can drive the demo's bring-up: PMODE /
DISPFB1 / DISPLAY1 are now reachable from `sw 0x1200_0080(...)`-
shaped writes rather than from a TB-shaped EE-MMIO port.
PASS criteria identical to Ch111: 4 EE-MMIO writes / 4 bridge
fires, same 16×8 captured frame. **Architectural note**: this
chapter ALSO adds 4 new output ports to `ee_memory_map_stub`
(`ee_gs_priv_wr_en/addr/data/be`). Existing 56 ee_memory_map_
stub-using TBs leave those outputs unconnected (named-port
instantiation tolerates omitted outputs); only the new Ch112
TB wires them through to the bridge.
- `tb_gs_demo_psmt4_e2e_eemmio.sv` (Ch111) — same demo as
Ch110 but the privileged-block control writes (PMODE / DISPFB1
/ DISPLAY1) now arrive through `ee_gs_priv_bridge_stub` (a new
RTL module) driven by EE-shaped 32-bit MMIO writes from the
TB, instead of TB-direct gs_stub.reg_wr_* pulses. The bridge
accumulates 32-bit half-writes per 8-byte slot and fires a
64-bit gs_stub.reg_wr_* pulse on each EE half-write —
single-half writes work for PMODE.EN1 and DISPFB1 (interesting
bits in the low 32), and a pair of writes (lo+hi to
consecutive 4-byte addresses) handles DISPLAY1 whose DW/DH
live in the high 32. **Bridge contract**: full-word writes
only — `ee_wr_be` must be `4'b1111`; sub-word (per-byte)
merging into the 64-bit shadow is intentionally out of scope
and a `$error` fires on any narrower be (control-plane GS
registers are always written as full 32-bit `sw` halves of an
`sd`). **Scope precision**: this chapter closes the TB-direct
`gs_stub.reg_wr_*` surface — i.e., the privileged-MMIO sink at
the GS itself. The bridge is instantiated by the TB directly;
it is NOT yet wired into `ee_memory_map_stub`, so the full
EE-CPU / memory-map MMIO path (a real EE instruction stream
reaching 0x12000000+ via `sw`) is a separate future chapter.
PASS criteria add to Ch110's: **4 EE-MMIO writes** (1 PMODE +
1 DISPFB1 + 2 DISPLAY1) and **4 bridge fires** producing the
same 16×8 captured frame as Ch110.
- `tb_gs_demo_psmct32_swizzle_trxdir_e2e.sv` (Ch124) — companion
to Ch123: same EE-bootlet → DMAC → GIF data plane and same all-
three-gates-on instantiation, but the framebuffer is filled by
a TRXDIR/IMAGE upload through `gif_image_xfer_stub` instead of
by raster. The Ch121 image-xfer write-side swizzle gate becomes
LOAD-BEARING inside the demo flow — every byte the GS produces
comes out of the image-xfer engine at canonical PSMCT32
swizzled addresses, and the raster path is dormant. Payload:
U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=1, DPSM=PSMCT32} /
TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} / TRXDIR{XDIR=0})
+ U2 (IMAGE, NLOOP=32: 32 IMAGE qwords carrying the 128 PSMCT32
pixels of the same four-quadrant pattern Ch123 used). DMAC QWC
= 38. Verification mirrors Ch123: (1) full 16×8 scanout frame
capture; (2) per-pixel byte readback at the canonical swizzled
address via vram_stub's 2nd read port; (3) strict linear-vs-
swizzled separator at byte 1024 stays 0. Aggregate counts:
`dma=(1,38,1) ee_dmac_wr=3 giftags=2 ad_writes=4
xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1
emits=0 frame=16x8`. Ch123 + Ch124 together exercise BOTH
PSMCT32 write-side paths (raster Ch122 + image-xfer Ch121)
end-to-end through the same driver-shaped flow with the
same swizzled-scanout (Ch120) read side.
- `tb_gs_demo_psmct32_swizzle_e2e.sv` (Ch123) — full driver-shaped
end-to-end demo with ALL THREE PSMCT32 swizzle gates flipped
on simultaneously: `gs_stub#(PSMCT32_SWIZZLE=1)` (Ch122 raster),
`gif_image_xfer_stub#(PSMCT32_SWIZZLE=1)` (Ch121 — instantiated
but unused in this demo), `gs_pcrtc_stub#(PSMCT32_SWIZZLE=1)`
(Ch120 read). The data plane is the same DMAC + GIF + EE-bootlet
shape Ch107..Ch114 demos use: a BIOS-resident EE program
(PC_RESET=0xBFC0_0000) configures GS-priv (DISPFB1, DISPLAY1
lo/hi, PMODE.EN1) via `sw` instructions through
`ee_memory_map_stub``ee_gs_priv_bridge_stub`
`gs_stub.reg_wr_*`, then kicks DMAC ch2 (MADR / QWC / CHCR)
via `sw` to the DMAC reg window, then `SYSCALL` halts. DMAC
delivers a 24-qword payload from `ee_ram_stub` to
`gif_packed_stub`, which dispatches 4 SPRITE PACKED packets
(1 GIFtag + 5 A+D each — PRIM, FRAME_1=PSMCT32, RGBAQ, XYZ2,
XYZ2). The 4 sprites tile the 16×8 active area into 4 quadrants
with unique RGB triples. With the raster gate on, all 128
per-pixel store addresses go through `gs_swizzle_psmct32_stub`;
with the pcrtc gate on, scanout reads from those same swizzled
addresses. **Two-phase verification**: (1) **scanout** — every
(x, y) in 16×8 captures its sprite's RGB; (2) **byte readback
via vram_stub's 2nd read port** — for every (x, y), the 32-bit
word at `ref_addr_psmct32(0, 1, x, y)` equals the sprite's
`{A=0xFF, B, G, R}` PSMCT32 word. Strict linear-vs-swizzled
separator at byte 1024 (where the linear formula's y=4 row
would land at stride=256) stays 0 — the swizzled write set
for the 16×8 image stays in blocks (0,0) and (1,0) of page 0
(bytes 0..511), so a fall-through to linear would have placed
sprite-2's color at byte 1024. Aggregate counts:
`dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20 xfer_writes=0
ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=128 frame=16x8`.
This is the FIRST end-to-end demo where every PSMCT32 byte
the GS produces lives at the canonical PCSX2 swizzled address
AND the scanout reads from it — byte-accurate to real PS2
VRAM layout, end-to-end through the driver-shaped flow.
- `tb_gs_raster_swizzle_psmct32.sv` (Ch122) — focused contract
for the new `PSMCT32_SWIZZLE` parameter on `gs_stub`. When the
parameter is set to 1 AND the active raster PSM is PSMCT32
(`ras_psm == 6'h00`), the per-pixel raster emit address is
routed through the Ch119 `gs_swizzle_psmct32_stub` (FBP=ras_fbp,
FBW=ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) and its output is
the absolute byte address (FBP*2048 already baked in).
At Ch122 only, PSMCT16/PSMT8/PSMT4 raster emits always took
the linear path. Ch128 later closed the PSMCT16 raster gate
and Ch134 closed the PSMT8 raster gate (each with its own
per-PSM parameter on this same `gs_stub`); PSMT4 raster still
takes the linear path. Default 0 keeps every existing PSMCT32
raster TB unchanged.
**Three-phase verification**: (1) **origin SPRITE** — drive a
single 16×4 SPRITE at FRAME_1{FBP=0, FBW=1, PSMCT32} with RGBAQ
R=0x55/G=0xAA/B=0xCC/A=0x77, expect 64 emits, per-pixel byte
readback via vram_stub's 2nd read port at swizzled addresses
confirms each pixel lands where the swizzle says. Strict
linear-vs-swizzled separators at bytes 512 and 768 (the linear
formula's y=2 / y=3 row starts) stay 0 — proves the gate is
live. (2) **scanout agreement** — enable the Ch120 swizzled-
pcrtc path on the same VRAM contents, capture the full 16×4
frame, assert each visible pixel reads back the SPRITE's RGB.
Both gs_stub (Ch122 raster) and gs_pcrtc_stub (Ch120 scanout)
instantiate the same swizzle module; a successful capture
proves the two integrations agree at byte level — what raster
wrote at swizzled addresses comes out on r/g/b at the same
(x, y). (3) **non-origin SPRITE** — re-arm the raster with
FRAME_1{FBP=4, FBW=2, PSMCT32} and an 8×2 SPRITE at
(60, 4)..(67, 5) crossing the page-x boundary at x=64 (so
page_index actually changes mid-row). Pins three contracts
the origin transfer can't distinguish from a buggy
implementation: (a) `ras_fbp` reaches the swizzle's `fbp` input
(FBP=0 in Phase 1 would have masked a tied-zero regression),
(b) `ras_fbw` reaches the swizzle's `fbw` input (FBW=1 would
have masked a tied-one regression), (c) the swizzle gets the
FULL absolute pixel coords (s2_x_q, s2_y_q) rather than
bbox-local coords (Phase 1's sprite started at (0,0) so
absolute and local were equal there). Strict linear-vs-
swizzled separator at byte 10480 (where the linear formula
would land Phase-3's first pixel) stays 0. Total emit count
after all phases: 64 + 16 = 80. With Ch120 (read), Ch121
(TRXDIR upload), and Ch122 (raster emit) all live, the three
major PSMCT32 paths are byte-consistent end-to-end.
- `tb_gs_image_xfer_swizzle_psmct32.sv` (Ch121) — focused contract
for the new `PSMCT32_SWIZZLE` parameter on `gif_image_xfer_stub`.
When the parameter is set to 1 AND the upload's PSM is PSMCT32,
per-pixel VRAM byte addresses are routed through the Ch119
`gs_swizzle_psmct32_stub` (FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+
cur_y) and `dest_base_q (= DBP*256)` is added back to anchor at
the upload's DBP base. PSMCT16/PSMT8/PSMT4 always take the
linear path. Default 0 keeps every existing image-xfer TB
unchanged. **Three-phase verification**: (1) **origin transfer**
— TRXDIR upload of a 16×4 PSMCT32 image at DBP=DSAX=DSAY=0,
DBW=1, RRW=16, RRH=4 → 64 pixels, 16 IMAGE qwords. After the
upload completes, the TB reads VRAM via vram_stub's 2nd read
port at the SWIZZLED byte address (TB-side `ref_addr()` mirrors
the swizzle module) and asserts each pixel landed where the
swizzle says. Strict linear-vs-swizzled separator: bytes 512
and 768 (where linear y=2 and y=3 rows would land) stay 0 under
swizzled, since the 16×4 image only fills blocks (0,0) and (1,0)
which together cover bytes [0..127] [256..383]. (2) **scanout
agreement** — enable the Ch120 swizzled-pcrtc path on the same
VRAM contents, capture the full 16×4 frame, assert each
scanned-out pixel matches its uploaded color. Both upload and
scanout instantiate the same `gs_swizzle_psmct32_stub`, so a
successful capture proves the two integrations agree at byte
level — what was written by TRXDIR comes out on r/g/b at the
same (x, y). (3) **non-origin transfer** — re-arm with NONZERO
DBP, DSAX, and DSAY (DBP=8, DSAX=4, DSAY=2, RRW=8, RRH=4) and
verify each uploaded pixel lands at `DBP*256 + swizzle(0, DBW,
DSAX+x_local, DSAY+y_local)`. Phase 3 pins TWO contracts the
origin transfer can't distinguish from a buggy implementation:
(a) `dest_base_q (= DBP*256)` is correctly ADDED ON TOP of the
swizzle output (with DBP=0 a missing-add regression would still
pass), and (b) the swizzle is fed the FULL effective coordinates
(with DSAX=DSAY=0 a "feeds only cur_x/cur_y" regression would
still pass). Strict linear-vs-swizzled separator at byte 3088
(where the linear formula's y=2 row of the P3 image would
land) stays 0 under swizzled. NOTE: gs_stub raster writes
still use linear addressing — that wiring is a follow-on
chapter.
- `tb_gs_scanout_swizzle_psmct32.sv` (Ch120) — focused contract
for the new `PSMCT32_SWIZZLE` parameter on `gs_pcrtc_stub`. When
the parameter is set to 1 AND the active PSM is PSMCT32, PCRTC
reads VRAM at swizzled addresses (via the Ch119 swizzle module
instantiated inside pcrtc) instead of the legacy linear formula.
Other PSMs (CT16/T8/T4) and `PSMCT32_SWIZZLE=0` keep the original
linear path unchanged. Topology: TB drives `vram_stub.write_*`
directly with each pixel's color preloaded at the swizzled byte
address (TB-side `ref_addr()` mirrors the DUT swizzle math), then
pcrtc with `PSMCT32_SWIZZLE=1` scans out the frame and the TB
asserts each captured pixel matches the preloaded color. Image
is 16×4 PSMCT32 (covers blocks (0,0) AND (1,0) horizontally) at
FBP=0/FBW=1; pcrtc active area is 8×4 (block (0,0) entirely),
but the swizzle vs. linear distinction shows up at any y>0
(linear y=1 → byte 64; swizzled byte 32) so even the in-window
region is a strict separator. Per-pixel color is unique
(`{A=0xFF, B=y<<4, G=x<<4, R=0x10|(y*8+x)}`) so any wrong-
address commit surfaces immediately. NOTE: at Ch120 ONLY,
gs_stub raster writes and gif_image_xfer_stub uploads still
used linear addressing — Ch120 was read-side only. Ch121
(image-xfer) and Ch122 (raster) closed the write-side gates,
and Ch123 demonstrates all three running together end-to-end.
- `tb_gs_demo_psmt8_swizzle_trxdir_e2e.sv` (Ch136) — companion to
Ch135: same EE-bootlet → DMAC → GIF data plane and same all-
three-gates-on instantiation, but the framebuffer is filled by
a TRXDIR/IMAGE upload through `gif_image_xfer_stub` instead of
by raster. The Ch133 PSMT8 image-xfer write-side swizzle gate
becomes LOAD-BEARING inside the demo flow — every byte the GS
produces comes out of the image-xfer engine at canonical PSMT8
swizzled addresses, and the raster path is dormant. Mirrors
Ch124 PSMCT32 + Ch130 PSMCT16 TRXDIR demos for the third PSM.
Payload: U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=2,
DPSM=PSMT8} / TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} /
TRXDIR{XDIR=0}) + U2 (IMAGE, NLOOP=8: 8 IMAGE qwords each
carrying 16 PSMT8 bytes for the 16×8 image, row-major). DBW=2
is the minimum even DBW for PSMT8. DMAC QWC=14. Per-quadrant
byte indices Q0=0xA0/Q1=0x40/Q2=0xC0/Q3=0x60 reused from Ch135
so the verify side is unchanged. New `trxdir_arms_seen` counter
asserts =1 (single TRX setup) + xfer-side per-emit observer
asserts every xfer_we pulse fires with be=4'b0001, mask=
0xFFFFFFFF (PSMT8 single-byte commit shape). Verification
mirrors Ch135: (1) full 16×8 scanout frame capture; (2) per-
pixel BYTE readback at the canonical swizzled byte address
(with `addr[1:0]` selecting the right byte from the 32-bit
word) via vram_stub's 2nd port; (3) strict separators at bytes
128 and 256 stay 0. Aggregate counts: `dma=(1,14,1)
ee_dmac_wr=3 giftags=2 ad_writes=4 trxdir_arms=1
xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1
emits=0 frame=16x8`. **First-attempt PASS** errors=0. Ch135 +
Ch136 together close the PSMT8 byte-accuracy milestone end-
to-end through the full driver-shaped flow — same Ch123+Ch124
(PSMCT32) and Ch129+Ch130 (PSMCT16) shape.
- `tb_gs_demo_psmt4_swizzle_trxdir_e2e.sv` (Ch142) — companion to
Ch141 (raster-driven PSMT4 e2e): same EE-bootlet → DMAC → GIF
data plane and same all-three-gates-on instantiation, but the
framebuffer is filled by a TRXDIR/IMAGE upload through
`gif_image_xfer_stub` instead of by raster. The Ch139 PSMT4
image-xfer write-side swizzle gate becomes LOAD-BEARING inside
the demo flow — every nibble the GS produces comes out of the
image-xfer engine at canonical PSMT4 swizzled (addr,
nibble_hi) slots, and the raster path is dormant. Mirrors
Ch124's PSMCT32 TRXDIR demo, Ch130's PSMCT16 TRXDIR demo, and
Ch136's PSMT8 TRXDIR demo for the fourth (and last) common
GS PSM. Cloned from Ch136 and surgically retargeted to
PSMT4. Payload: U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=2,
DPSM=PSMT4} / TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} /
TRXDIR{XDIR=0}) + U2 (IMAGE, NLOOP=4 EOP=1: 4 IMAGE qwords
carrying 32 PSMT4 nibbles each — at RRW=16 each qword holds
2 rows: lanes 0..15 = row 2*qi, lanes 16..31 = row 2*qi+1,
matching Ch139's focused-TB packing). Total QWC = 10 (5+5).
EE-bootlet DISPFB1 immediate identical to Ch141 (LUI 0x000A;
ORI 0x0400 → PSM=PSMT4). Per-quadrant nibbles match Ch141
verbatim (Q0=0xA → 0xAA, Q1=0x4 → 0x44, Q2=0xC → 0xCC,
Q3=0x6 → 0x66) so the verify side reuses Ch141's pattern
unchanged. Verification mirrors Ch141: (1) full 16×8 scanout
frame capture via Ch138 swizzled-pcrtc; (2) per-pixel NIBBLE
readback at the canonical swizzled (addr, nibble_hi) slot
via vram_stub's 2nd port (addr[1:0]-keyed byte selection
then nibble_hi-keyed nibble selection); (3) strict linear-
vs-swizzled separator at byte 128 stays 0 (per-byte check,
not full word: a neighbor byte may legitimately be touched);
(4) per-emit observer asserts every image-xfer write is
`be=4'b0001` / `mask ∈ {0x0F, 0xF0}` (PSMT4 nibble RMW
shape) and the `trxdir_wr_q` arming pulse fires exactly
once. Aggregate counts: `dma=(1,10,1) ee_dmac_wr=3
giftags=2 ad_writes=4 trxdir_arms=1 xfer_writes=128
ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=0 frame=16x8`.
Ch141 + Ch142 together exercise BOTH PSMT4 write-side paths
(raster Ch140 + image-xfer Ch139) end-to-end through the
same driver-shaped flow with the same swizzled-scanout
(Ch138) read side — bringing PSMT4 to full parity with the
PSMCT32, PSMCT16, and PSMT8 e2e coverage from Ch123+Ch124,
Ch129+Ch130, and Ch135+Ch136. **Architectural milestone**:
this is the first state of the project where ALL FOUR
common GS PSMs (CT32 + CT16 + T8 + T4) have BOTH a raster-
driven AND a TRXDIR-driven driver-shaped end-to-end byte-
accuracy demo — closing the **four-PSM × three-path × dual-
driver-shape e2e foundation** (8 demos total). The bug-fix
iteration: TB-side `ref_col_idx4` was first written with a
7-bit case key `{yb[2:0], xb[3:0]}` covering yb=0..7 in
indices 0..127, but the values for yb=4..7 were miscopied
from Ch139's yb=12..15 row (Ch139 only exercises yb=0..3
and yb=12..15). Phase 2 readback failed for all 64 pixels
in y=4..7 with `got=0 expected=0xC/0x6` — the engine wrote
the right nibbles to the right addresses (scanout passed),
but the TB's ref looked at the wrong slot. Fix: switched to
Ch141's 9-bit case key `{yb[3:0], xb[4:0]}` and used
Ch141's verified yb=0..7 values verbatim. **First-attempt
PASS** after the table fix.
- `tb_gs_demo_psmt4_swizzle_e2e.sv` (Ch141) — first driver-shaped
end-to-end PSMT4 demo with all three PSMT4 swizzle gates
(Ch138 read-side pcrtc, Ch139 image-xfer write-side, Ch140
raster write-side) parameter-set to 1 simultaneously, but with
the demo flow exercising only the raster (Ch140) + scanout
(Ch138) paths as load-bearing. The Ch139 image-xfer gate is
smoke-only here (parameter is set but `xfer_writes_seen == 0`
is asserted, since no TRXDIR/IMAGE packet is delivered in the
raster-driven payload); the Ch139 load-bearing variant is
the Ch142 TRXDIR-driven PSMT4 e2e (mirrors Ch124/Ch130/Ch136).
PSMT4 counterpart of Ch123's PSMCT32 /
Ch129's PSMCT16 / Ch135's PSMT8 e2e demos. Same EE-bootlet →
DMAC → GIF data plane: BIOS-resident EE program configures
GS-priv (DISPFB1 PSMT4 with FBW=2, DISPLAY1, PMODE) via `sw`
instructions → kicks DMAC ch2 → SYSCALL halts. DMAC delivers
a 24-qword payload (4 SPRITE PACKED packets) through
`gif_packed_stub` into `gs_stub` raster. The 4 sprites tile
the 16×8 active area into 4 quadrants with per-quadrant unique
RGBAQ.R[3:0] nibbles (Q0=0xA → 0xAA, Q1=0x4 → 0x44,
Q2=0xC → 0xCC, Q3=0x6 → 0x66). PSMT4 raster (Ch106) takes
RGBAQ.R[3:0] as the nibble that hits VRAM via the existing
Ch106 nibble RMW machinery (write_be=4'b0001 + write_mask
0x0F or 0xF0); Ch140 keys the high/low nibble selector off the
swizzle's `nibble_hi` output instead of `s2_pixel_index[0]`.
PCRTC's Ch103 PSMT4 grayscale fallback (clut_enable=0)
surfaces the nibble as r=g=b={n, n} at scanout, so each
captured pixel IS the nibble we wrote (no CLUT setup needed
for this demo; a CLUT-driven Ch141 variant is a future
chapter). With the raster gate on, all 128 per-pixel nibble
stores go through `gs_swizzle_psmt4_stub`; with the pcrtc
gate on, scanout reads from those same swizzled (addr,
nibble_hi) slots. **Two-phase verification**: (1) full-frame
scanout asserts each (x, y) reads back its quadrant's nibble
as PSMT4 grayscale r=g=b={n, n}; (2) per-pixel NIBBLE readback
at the canonical swizzled address (with `addr[1:0]` selecting
the right byte from the 32-bit word, then `nibble_hi`
selecting which nibble of that byte) via vram_stub's 2nd
port — the 16×8 PSMT4 image lives entirely in the upper-left
of block (0,0) of page 0 (PSMT4 block = 32×16 px) and the
within-block columnTable4 yb=0..7 / xb=0..15 exercises
nibble_idx values [0..127]. Strict linear-vs-swizzled
separator at byte 128 (linear y=2 row start at PSMT4
stride=64 with FBW=2) stays 0 — outside block (0,0)'s
touched range. Per-emit observer locks PSM=0x14, be=4'b0001,
mask ∈ {0x0F, 0xF0}. Aggregate counts: `dma=(1,24,1)
ee_dmac_wr=3 giftags=4 ad_writes=20 xfer_writes=0
ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=128 frame=16x8`.
**First-attempt PASS** errors=0. Together with Ch123 (PSMCT32
e2e), Ch129 (PSMCT16 e2e), and Ch135 (PSMT8 e2e), this is the
first state of the project where the full driver-shaped flow
has end-to-end byte-accuracy demos for ALL FOUR common GS
PSMs (CT32 + CT16 + T8 + T4) under software-shaped raster
traffic. The TRXDIR-driven PSMT4 companion landed at Ch142
(mirror of Ch124/Ch130/Ch136 making Ch139 load-bearing), so
Ch141 + Ch142 together close the PSMT4 byte-accuracy
milestone end-to-end through both driver shapes — bringing
PSMT4 to full parity with CT32/CT16/T8.
- `tb_gs_demo_psmt8_swizzle_e2e.sv` (Ch135) — first driver-shaped
end-to-end PSMT8 demo with all three PSMT8 swizzle gates
(Ch132 read-side pcrtc, Ch133 image-xfer write-side, Ch134
raster write-side) parameter-set to 1 simultaneously, but with
the demo flow exercising only the raster (Ch134) + scanout
(Ch132) paths as load-bearing. The Ch133 image-xfer gate is
smoke-only here (parameter is set but `xfer_writes_seen == 0`
is asserted, since no TRXDIR/IMAGE packet is delivered in the
raster-driven payload); the Ch133 load-bearing variant is the
Ch136 TRXDIR-driven PSMT8 e2e (mirror of Ch124/Ch130). PSMT8
counterpart of Ch123's PSMCT32 / Ch129's PSMCT16 e2e demos. Same EE-bootlet → DMAC → GIF data plane:
BIOS-resident EE program configures GS-priv (DISPFB1 PSMT8
with FBW=2, DISPLAY1, PMODE) via `sw` instructions → kicks
DMAC ch2 → SYSCALL halts. DMAC delivers a 24-qword payload
(4 SPRITE PACKED packets) through `gif_packed_stub` into
`gs_stub` raster. The 4 sprites tile the 16×8 active area into
4 quadrants with per-quadrant unique RGBAQ.R values
(Q0=0xA0, Q1=0x40, Q2=0xC0, Q3=0x60). PSMT8 raster (Ch105)
takes the natural ABGR's R channel as the byte index that
hits VRAM; PCRTC's Ch96 grayscale fallback (clut_enable=0)
surfaces the byte as R=G=B at scanout, so each captured pixel
IS the byte we wrote (no CLUT setup needed for this demo;
a CLUT-driven Ch135 variant is a future chapter). With the
raster gate on, all 128 per-pixel byte stores go through
`gs_swizzle_psmt8_stub`; with the pcrtc gate on, scanout
reads from those same swizzled addresses. **Two-phase
verification**: (1) full-frame scanout asserts each (x, y)
reads back its quadrant's byte as PSMT8 grayscale R=G=B; (2)
per-pixel BYTE readback at the canonical swizzled address
(with `addr[1:0]` selecting the right byte from the 32-bit
word) via vram_stub's 2nd port — the 16×8 PSMT8 image lives
entirely in the upper half of block (0,0) of page 0 (PSMT8
block = 16×16 px) and the within-block columnTable8 yb=0..7
exercises byte values [0..127]. Strict linear-vs-swizzled
separators at bytes 128 (linear y=1 row start at PSMT8
stride=128 with FBW=2) and 256 (linear y=2) stay 0 — both
outside block (0,0)'s touched range. Aggregate counts:
`dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20
xfer_writes=0 ee_priv_wr=4 bridge_fires=4 core_halt=1
emits=128 frame=16x8`. Together with Ch123 (PSMCT32 e2e) and
Ch129 (PSMCT16 e2e), this was the first state of the project
where the full driver-shaped flow had end-to-end byte-accuracy
demos for the CT32/CT16/T8 trio under software-shaped traffic.
PSMT4 was the natural follow-on and landed at Ch141 (raster-
driven, mirror of this demo) + Ch142 (TRXDIR-driven, mirror
of Ch136), closing the four-PSM × dual-driver-shape e2e
matrix.
- `tb_gs_demo_psmct16_swizzle_trxdir_e2e.sv` (Ch130) — companion
to Ch129: same EE-bootlet → DMAC → GIF data plane and same all-
three-gates-on instantiation, but the framebuffer is filled by
a TRXDIR/IMAGE upload through `gif_image_xfer_stub` instead of
by raster. The Ch127 image-xfer write-side swizzle gate becomes
LOAD-BEARING inside the demo flow — every byte the GS produces
comes out of the image-xfer engine at canonical PSMCT16
swizzled addresses, and the raster path is dormant. Payload:
U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=1, DPSM=PSMCT16} /
TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} / TRXDIR{XDIR=0})
+ U2 (IMAGE, NLOOP=16: 16 IMAGE qwords carrying the 128 PSMCT16
halfwords of the same four-quadrant pattern Ch129 used). DMAC
QWC = 22. Verification mirrors Ch129: (1) full 16×8 scanout
frame capture; (2) per-pixel halfword readback at the canonical
swizzled byte address (with `addr[1]` selecting the right 16-bit
slot) via vram_stub's 2nd read port; (3) strict linear-vs-
swizzled separators at bytes 256 and 384 stay 0; (4) per-emit
observer asserts every image-xfer write is `be=4'b0011` /
`mask=0xFFFF_FFFF` (low halfword) and the `trxdir_wr_q` arming
pulse fires exactly once. Aggregate counts: `dma=(1,22,1)
ee_dmac_wr=3 giftags=2 ad_writes=4 trxdir_arms=1
xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1
emits=0 frame=16x8`. Ch129 + Ch130 together exercise BOTH
PSMCT16 write-side paths (raster Ch128 + image-xfer Ch127)
end-to-end through the same driver-shaped flow with the same
swizzled-scanout (Ch126) read side — bringing PSMCT16 to
full parity with the PSMCT32 e2e coverage from Ch123 + Ch124.
- `tb_gs_demo_psmct16_swizzle_e2e.sv` (Ch129) — full driver-shaped
end-to-end demo with all three PSMCT16 swizzle gates
(Ch126 read-side pcrtc, Ch127 image-xfer write-side, Ch128
raster write-side) parameter-set to 1 simultaneously, but with
the demo flow exercising only the raster (Ch128) + scanout
(Ch126) paths as load-bearing. The Ch127 image-xfer gate is
smoke-only here (parameter is set but `xfer_writes_seen == 0`
is asserted, since no TRXDIR/IMAGE packet is delivered in the
raster-driven payload); Ch130 (TRXDIR-driven PSMCT16 e2e) is
the load-bearing image-xfer-side counterpart.
PSMCT16 counterpart of Ch123's PSMCT32 e2e demo. Same EE-bootlet → DMAC
→ GIF data plane: BIOS-resident EE program configures GS-priv
(DISPFB1 PSMCT16, DISPLAY1, PMODE) via `sw` instructions →
kicks DMAC ch2 → SYSCALL halts. DMAC delivers a 24-qword
payload (4 SPRITE PACKED packets) through `gif_packed_stub`
into `gs_stub` raster. The 4 sprites tile the 16×8 active area
into 4 quadrants with per-quadrant unique RGB5A1 colors picked
so the 5→8 bit-replicate at PCRTC output produces unique 8-bit
RGB triples. With the raster gate on, all 128 per-pixel
halfword stores go through `gs_swizzle_psmct16_stub`; with the
pcrtc gate on, scanout reads from those same swizzled
addresses. **Two-phase verification**: (1) full-frame scanout
asserts each (x, y) reads back its quadrant's 5→8-expanded
RGB; (2) per-pixel halfword readback via vram_stub's 2nd port
at swizzled addresses (with `addr[1]` selecting the right
16-bit slot) confirms each sprite halfword landed where the
swizzle says — the 16×8 PSMCT16 image lives entirely in block
(0,0) of page 0 (PSMCT16 block = 16×8 px), so the readback
exercises ALL 16 xb × 8 yb entries of `columnTable16`. Strict
linear-vs-swizzled separators at bytes 256 (linear y=2 row
start at PSMCT16 stride=128) and 384 (linear y=3) stay 0 —
both outside block (0,0)'s 256-byte range. Aggregate counts:
`dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20
xfer_writes=0 ee_priv_wr=4 bridge_fires=4 core_halt=1
emits=128 frame=16x8`. Together with Ch123 (PSMCT32 e2e),
this is the first state of the project where the full
driver-shaped flow has end-to-end byte-accuracy demos for
BOTH direct-color PS2 PSMs.
- `tb_gs_raster_swizzle_psmct16.sv` (Ch128) — focused contract for
the new `PSMCT16_SWIZZLE` parameter on `gs_stub` (the raster emit
surface). Mirrors Ch122's wiring shape but for PSMCT16: when the
parameter is 1 AND the active raster PSM is PSMCT16
(`ras_psm == 6'h02`), the per-pixel raster emit address is routed
through the Ch125 `gs_swizzle_psmct16_stub` (FBP=ras_fbp, FBW=
ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) — its output is the
absolute byte address. PSMCT32 is gated by its own
`PSMCT32_SWIZZLE` parameter (Ch122). At Ch128 only, PSMT8/PSMT4
raster emits stayed linear; Ch134 later closed the PSMT8 raster
gate via `PSMT8_SWIZZLE` on this same `gs_stub`. PSMT4 raster
still takes the linear path.
Default 0 keeps every existing PSMCT16 raster TB (Ch95 etc.)
unchanged. **Three-phase verification**: (1) **origin SPRITE**
— drive a 16×4 PSMCT16 SPRITE at FRAME_1{FBP=0, FBW=1, PSMCT16}
with RGBAQ {R=0xAA, G=0x50, B=0xC0, A=0x00} → halfword 0x6155
(R5=0x15, G5=0x0A, B5=0x18, A1=0). Per-pixel halfword readback
via vram_stub's 2nd port (with `addr[1]` selecting the right
16-bit slot) confirms each lands at the swizzled byte. The
16×4 image lives in block (0,0) of page (0,0), so within-block
columnTable16 rows 0..3 are exercised. **Strict separators**:
bytes 128 (linear y=1 row start at PSMCT16 stride=128) and 256
(linear y=2) stay 0 — proves the gate is live, since a fall-
through to the legacy linear path would put the SPRITE
halfword there. (2) **scanout agreement** — enable the Ch126
swizzled-pcrtc path on the same VRAM contents, capture the
full 16×4 frame, assert each visible pixel reads back the
expected RGB after PCRTC's 5→8 bit-replicate (RGB = {0xAD,
0x52, 0xC6}). Both gs_stub (Ch128 raster) and gs_pcrtc_stub
(Ch126 scanout) instantiate the same swizzle module. (3)
**non-origin SPRITE** — re-arm with FRAME_1{FBP=4, FBW=2,
PSMCT16} and an 8×4 SPRITE at (60, 4)..(67, 7) with distinct
color (halfword 0x9F8E). Crosses the PAGE-x boundary at x=64
(page (0,0) for x∈[60..63] — block (0,3) by swizzle table —
vs page (1,0) for x∈[64..67] — block (0,0)) so page_index
changes mid-row. Within-block column-table coords (xb=12..3,
yb=4..7) cover columnTable16 rows 4..7 — a different region
than Phase 1's yb=0..3. Pins three contracts Phase 1 can't:
(a) `ras_fbp` reaches the swizzle's `fbp` input (FBP=0 in P1
would mask a tied-zero); (b) `ras_fbw` reaches `fbw` (FBW=1
in P1 would mask a tied-one); (c) the swizzle gets the FULL
absolute pixel coords s2_x_q/s2_y_q rather than bbox-local
(P1's sprite started at (0,0), so absolute and local were
equal). Strict P3 separator at byte 9336 (linear formula's
effective (60, 4) byte) stays 0 — outside the P3 swizzled
write set, which lives in block (0,3) of page (0,0)
(10914..11006) and block (0,0) of page (1,0) (16512..16604).
Total emit count after all phases: 64 + 32 = 96. With Ch126
(read), Ch127 (TRXDIR upload), and Ch128 (raster emit) all
live, the three major PSMCT16 paths are byte-consistent
end-to-end — completes the byte-accuracy milestone for the
second PSM, mirroring the Ch120/Ch121/Ch122 PSMCT32 closure.
- `tb_gs_image_xfer_swizzle_psmct16.sv` (Ch127) — focused contract
for the new `PSMCT16_SWIZZLE` parameter on `gif_image_xfer_stub`.
Mirrors Ch121's wiring shape but for PSMCT16: when the parameter
is 1 AND the upload's PSM is PSMCT16, per-pixel byte addresses
route through the Ch125 `gs_swizzle_psmct16_stub` (FBP=0,
FBW=DBW, x=DSAX+cur_x, y=DSAY+cur_y) and `dest_base_q
(= DBP*256)` is added back to anchor at the upload's DBP base.
PSMCT32 is gated by its own PSMCT32_SWIZZLE parameter (Ch121);
PSMT8/T4 always linear. Default 0 keeps every existing PSMCT16
image-xfer TB unchanged. **Three-phase verification**: (1)
**origin transfer** — TRXDIR upload of a 16×4 PSMCT16 image at
DBP=DSAX=DSAY=0, DBW=1, RRW=16, RRH=4 → 64 pixels, 8 IMAGE
qwords (8 px/qword for PSMCT16). After upload, the TB reads
vram_stub's 2nd port at the SWIZZLED byte address (TB-side
`ref_addr16/ref_block_idx16/ref_col_idx16` carry the verbatim
PCSX2 tables locked at Ch125) and asserts each halfword landed
where the swizzle says (selecting the right 16-bit slot inside
the 32-bit word via `addr[1]`). Strict linear-vs-swizzled
separators at bytes 128 (linear y=1) and 256 (linear y=2) stay
0 — swizzled writes for the 16×4 image fill only block (0,0)
bytes [0..126]. (2) **scanout agreement** — enable the Ch126
swizzled-pcrtc path on the same VRAM contents, capture the
full 16×4 frame, assert each scanned pixel matches the uploaded
RGB5A1 → RGB888 5→8 bit-replicate. Both upload and scanout
instantiate the same `gs_swizzle_psmct16_stub`. (3) **non-origin
transfer** — re-arm with DBP=8, DSAX=12, DSAY=6, RRW=8, RRH=4.
Effective coords (12..19, 6..9) cross block_x=0→1 at
effective_x=16 AND block_y=0→1 at effective_y=8, exercising
both block-table dimensions inside a single non-origin upload.
Pins three contracts the origin transfer can't distinguish from
a buggy implementation: (a) `dest_base_q (= DBP*256)` is added
on top of the swizzle output (DBP=0 in P1 would mask a
missing-add); (b) the swizzle is fed the FULL effective coords
(DSAX=DSAY=0 in P1 would mask a "feeds only cur_x/cur_y"
regression); (c) BOTH block_x and block_y propagate through
`blockTable16[by][bx]` (block_x=0 throughout P1 would mask a
tied-block_x regression). Strict P3 separator at byte 3096
(linear formula's effective (12, 8) byte) stays 0 — outside
the P3 swizzled write set [2048..3071]. NOTE (now historical):
PSMCT16 raster swizzle was deferred when Ch127 landed; it
shipped at Ch128 (mirrors Ch122 for PSMCT32) so the PSMCT16
raster path is now byte-consistent with the image-xfer path
documented here.
- `tb_gs_raster_swizzle_psmt4.sv` (Ch140) — focused contract for
the new `PSMT4_SWIZZLE` parameter on `gs_stub` (the raster emit
surface). Mirrors Ch122/Ch128/Ch134 wiring shape but for the
fourth (and last) PSM, and threads the Ch137 swizzle module's
`nibble_hi` output into the existing Ch106 PSMT4 raster nibble
RMW data lane (replacing `s2_pixel_index[0]` as the high/low
nibble selector when the gate is on). When the parameter is 1
AND the active raster PSM is PSMT4 (`ras_psm == 6'h14`), the
per-pixel raster emit address is routed through the Ch137
`gs_swizzle_psmt4_stub` (FBP=ras_fbp, FBW=ras_fbw,
x=s2_x_q[11:0], y=s2_y_q[11:0]) — its `addr` output is the
absolute byte address, AND its `nibble_hi` output keys
`s2_emit_color64`'s nibble placement and `s2_emit_mask`'s
high/low gating (write_be stays 4'b0001 for both paths).
PSMCT32/PSMCT16/PSMT8 are gated by their own parameters;
default 0 keeps every existing PSMT4 raster TB (Ch106
raster_psmt4, Ch107 PSMT4-e2e, Ch103 PSMT4+CLUT, Ch104 round-
trip, etc.) on the original linear addressing. No new ports.
Default-off smoke verification: ran Ch106 + Ch107 + Ch103 +
Ch104 PSMT4 TBs before writing the new TB; all PASSed
unchanged. **Three-phase verification** (mirrors Ch134 PSMT8
raster shape, with PSMT4 nibble adaptations + CLUT-disabled
grayscale at scanout):
(1) **origin SPRITE** at FBP=0/FBW=2 (FBW must be even per
PCSX2 GSLocalMemory.h:560 — same as PSMT8). Drive a 16×4 PSMT4
SPRITE with RGBAQ.R=0xAA (PSMT4 raster channel takes R[3:0] as
the nibble per Ch106 → nibble = 0xA). Per-pixel nibble readback
via vram_stub's 2nd port (with `addr[1:0]`-keyed byte
selection then `nibble_hi`-keyed nibble selection inside the
byte) confirms each pixel landed at the correct (byte, nibble)
slot. The image lives in the upper-left of block (0,0) of page
(0,0); within-block columnTable4 entries for yb=0..3, xb=0..15
cover nibble_idx values [0..127] → byte_in_block ∈ [0..63].
Strict separator: byte 64 (linear y=1 row start at PSMT4
FBW=2 stride 64) stays 0.
(2) **scanout agreement** — enable Ch138 swizzled-pcrtc on
the same VRAM, capture full 16×4 frame, assert each pixel
reads back as PSMT4 grayscale R=G=B={0xA, 0xA} = 0xAA. Both
gs_stub and gs_pcrtc_stub instantiate the same
`gs_swizzle_psmt4_stub` AND thread its `nibble_hi` output
through their respective nibble selectors — agreement at this
layer means both integrations land at the same byte+nibble
positions for PSMT4.
(3) **non-origin SPRITE** at FBP=4/FBW=4 (bw_pg=2) drawing
8×4 SPRITE at (124, 4)..(131, 7) with R=0x55 (nibble = 0x5).
Crosses PSMT4 PAGE-x at x=128 (page (0,0) for x∈[124..127],
page (1,0) for x∈[128..131]). 2 blocks visited:
blockTable4[0][3]=10 → page (0,0) block_base 10752;
blockTable4[0][0]=0 → page (1,0) block_base 16384. Pins three
contracts the origin transfer can't: ras_fbp reaches the
swizzle's fbp input; ras_fbw reaches fbw; the swizzle gets
the FULL absolute pixel coords s2_x_q/s2_y_q. Strict P3
separator at byte 8766 (linear (124, 4) at FBP=4/FBW=4) stays
0 — outside the P3 swizzled write set [10752..11007] +
[16384..16639]. Total emit count: 64 + 32 = 96. **First-
attempt PASS** errors=0.
With Ch138 (read-side), Ch139 (TRXDIR upload), and Ch140
(raster emit) all live, the three major PSMT4 paths can be
byte-consistent under the canonical swizzle when their gates
are flipped on — completing the **four-PSM × three-path
byte-accuracy foundation** (CT32 Ch120/Ch121/Ch122 + CT16
Ch126/Ch127/Ch128 + T8 Ch132/Ch133/Ch134 + T4 Ch138/Ch139/
Ch140). End-to-end PSMT4 swizzled demos (mirroring Ch123/
Ch124, Ch129/Ch130, Ch135/Ch136) are now possible.
- `tb_gs_raster_swizzle_psmt8.sv` (Ch134) — focused contract for
the new `PSMT8_SWIZZLE` parameter on `gs_stub` (the raster emit
surface). Mirrors Ch122's PSMCT32 + Ch128's PSMCT16 wiring shape
but for the third PSM: when the parameter is 1 AND the active
raster PSM is PSMT8 (`ras_psm == 6'h13`), the per-pixel raster
emit address is routed through the Ch131 `gs_swizzle_psmt8_stub`
(FBP=ras_fbp, FBW=ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) —
its output is the absolute byte address. PSMCT32/PSMCT16 are
gated by their own parameters; PSMT4 stays linear. Default 0
keeps every existing PSMT8 raster TB (Ch105 raster_psmt8, Ch107
PSMT4-via-CT16-CLUT palette path, etc.) on the original linear
addressing. No new ports — parameter-only API change. Default-
off smoke verification: ran Ch105 `tb_gs_raster_psmt8` before
writing the new TB; PASSed unchanged. **Three-phase verification**
(mirrors Ch128 PSMCT16 raster shape):
(1) **origin SPRITE** at FBP=0/FBW=2 (DBW must be even — PCSX2
asserts `(bw & 1) == 0` for PSMT8). Drive a 16×8 PSMT8 SPRITE
with RGBAQ.R=0xA5 (PSMT8 raster channel uses R as the byte
index per Ch105). Per-pixel byte readback via vram_stub's 2nd
port confirms each lands at the swizzled byte. The 16×8 image
lives in the upper half of block (0,0) of page (0,0); the
within-block columnTable8 distributes the 128 bytes across yb
rows 0..7 — byte values 0..127 within the block. **Strict
separators**: bytes 128 (linear y=1 row start at PSMT8
stride=128) and 256 (linear y=2) stay 0 — proves the gate is
live, since a fall-through to the legacy linear path would put
the SPRITE byte there. (2) **scanout agreement** — enable the
Ch132 swizzled-pcrtc path on the same VRAM, capture the full
16×8 frame, assert each pixel's PCRTC PSMT8 grayscale R=G=B
matches `idx=0xA5`. Both gs_stub and gs_pcrtc_stub instantiate
the same `gs_swizzle_psmt8_stub`, so success proves byte-level
agreement. (3) **non-origin SPRITE** at FBP=4/FBW=4 (bw_pg=2)
drawing 8×4 SPRITE at (124, 4)..(131, 7) with RGBAQ.R=0x5A.
Crosses PSMT8 PAGE-x at x=128 (x∈[124..127] is in page (0,0)
block (0,7) by swizzle table; x∈[128..131] is in page (1,0)
block (0,0)) so page_index changes mid-row. Pins three
contracts the origin transfer can't: `ras_fbp` reaches the
swizzle's fbp input (FBP=0 in P1 would mask a tied-zero);
`ras_fbw` reaches fbw (FBW=2 would mask a tied-two); the
swizzle gets the FULL absolute pixel coords s2_x_q/s2_y_q
rather than bbox-local (P1 sprite started at (0,0) so
absolute=local). PSMT8's page-x boundary at x=128 is different
from CT32/CT16's x=64, so this exercises the PSMT8-specific
x[7] wiring of the swizzle. Strict P3 separator at byte 9340
(linear (124, 4) at FBP=4/FBW=4) stays 0 — outside the P3
swizzled write set (page (0,0) block (0,7) at base 13568, page
(1,0) block (0,0) at base 16384). Total emit count: 128 + 32 =
160. **First-attempt PASS** errors=0. With Ch132 (read-side),
Ch133 (TRXDIR upload), and Ch134 (raster emit) all live, the
three major PSMT8 paths can be byte-consistent under the
canonical swizzle when their gates are flipped on — completing
the third-PSM byte-accuracy milestone for ALL three integration
points (mirrors the Ch120/Ch121/Ch122 PSMCT32 trio + the
Ch126/Ch127/Ch128 PSMCT16 trio).
- `tb_gs_image_xfer_swizzle_psmt4.sv` (Ch139) — focused contract
for the new `PSMT4_SWIZZLE` parameter on `gif_image_xfer_stub`.
Mirrors Ch121/Ch127/Ch133 wiring shape but for the fourth (and
last) PSM, and threads the Ch137 swizzle module's `nibble_hi`
output into the existing Ch118 nibble RMW data lane (replacing
`x_eff[0]` as the high/low nibble selector when the gate is
on). When the parameter is 1 AND the active DPSM is PSMT4, the
per-pixel byte address is `dest_base_q (= DBP*256) +
swizzle_psmt4(FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+cur_y).addr`,
AND `cur_mask_c` is `0x0000_00F0` when `swizzle4_nibble_hi=1`
(high nibble) or `0x0000_000F` when 0 (low nibble) — the
per-bit write_mask machinery (vram_stub merges only the
targeted nibble) layers on top of the swizzled address. PSMCT32
/PSMCT16/PSMT8 are gated by their own parameters. Default 0
keeps the legacy linear path for every existing PSMT4 image-
xfer TB (Ch118 etc.). No new ports — parameter-only API
change. Default-off smoke verification: ran Ch118
`tb_gs_image_xfer_psmt4` before writing the new TB; PASSed
unchanged. **Three-phase verification** (mirrors Ch127/Ch133
audit-closed shape): (1) **origin write-side lock** at DBP=0/
DBW=2/DSAX=DSAY=0 (DBW must be even per PCSX2 GSLocalMemory.h:
560 — same FBW-evenness as PSMT8). 16×4 PSMT4 image upload via
2 IMAGE qwords (32 px/qword for PSMT4 = 4 rows × 16-px row at
RRW=16). After upload, per-pixel nibble readback at the
swizzled `(addr, nibble_hi)` slot asserts each nibble landed
where the swizzle says. Strict separator: PSMT4 row stride at
DBW=2 = DBW*32 = 64 bytes, so linear y=1 starts at byte 64.
Swizzled write set lives in [0..63] within block (0,0). Byte
64 stays 0 (verified via per-byte check, not full-word — the
`check_byte_zero` task initially had a full-word bug that
misreported neighbor-byte writes; fixed to check only the
targeted byte via `addr[1:0]`-keyed case statement).
(2) **end-to-end agreement**: enable Ch138 PSMT4 swizzled
scanout on the same VRAM (PSMT4_SWIZZLE=1 on pcrtc, CLUT
disabled), capture the 16×4 frame, verify each pixel's grayscale
R=G=B={nibble, nibble} matches `nibble_at(xx, yy)`. Both
modules instantiate the same `gs_swizzle_psmt4_stub` so success
proves byte+nibble-level agreement under TRXDIR-style emit +
scanout-style read.
(3) **non-origin transfer** at DBP=8/DBW=2/DSAX=28/DSAY=12/
RRW=8/RRH=8. Effective coords (28..35, 12..19) cross block_x=
0→1 at effective_x=32 AND block_y=0→1 at effective_y=16 (PSMT4
block geometry: 32×16 px). All 4 corner blocks of page (0,0)
at DBP=8 visited: blockTable4[0][0]=0, [0][1]=2, [1][0]=1,
[1][1]=3 (block bases 2048/2560/2304/2816). Pins three
contracts the origin transfer can't: dest_base_q ADDED ON TOP
of the swizzle output (DBP=0 in P1 would mask a missing-add
regression — fixed during bring-up after the TB initially
passed P3_DBP directly to ref_pos_psmt4 instead of using
fbp_v=0 + adding DBP*256); FULL effective coords; BOTH
block_x and block_y propagate through `blockTable4[by][bx]`.
Phase 3 strict separator: linear formula puts effective coord
(28, 12) at byte 2830 — under linear, the neighboring pixel
(29, 12) writes high nibble = 1 to that byte. Under swizzled,
no Phase-3 pixel hits byte 2830 (cross-checked: col_idx_psmt4
for the 4-block × 16-pixel coord set never produces nibble_idx
28 or 29). Byte 2830 stays 0 → fall-through to linear would
have stomped it with 0x10. **PASS** errors=0 after two bug-fix
iterations: (a) ref_pos_psmt4(P3_DBP, ...) was wrong — engine
feeds FBP=0 to the swizzle and adds DBP*256 separately, so TB
must do the same; (b) check_byte_zero tested the full word
instead of the targeted byte, producing false failures when a
neighbor byte in the same word was independently touched.
Counts: arms=2, writes=128 (P1 64 + P3 64). With Ch138 (read-
side scanout) + Ch139 (image-xfer write-side) + Ch140 (raster
write-side) all live, the Ch137 PSMT4 primitive now has all 3
integration points wired, and Ch141 closes the e2e demo.
- `tb_gs_image_xfer_swizzle_psmt8.sv` (Ch133) — focused contract
for the new `PSMT8_SWIZZLE` parameter on `gif_image_xfer_stub`.
Mirrors Ch121's PSMCT32 + Ch127's PSMCT16 wiring shape but for
the third PSM: when the parameter is 1 AND the active DPSM is
PSMT8, the per-pixel byte address is `dest_base_q (= DBP*256) +
swizzle_psmt8(FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+cur_y)`.
PSMCT32/PSMCT16 are gated by their own parameters; PSMT4 stays
linear (its swizzle math is future). Default 0 keeps the legacy
linear path for every existing PSMT8 image-xfer TB (Ch117 etc.).
No new ports — parameter-only API change. Default-off smoke
verification: ran Ch117 `tb_gs_image_xfer_psmt8` before writing
the new TB; PASSed unchanged. **Three-phase verification**
(mirrors Ch127 audit-closed shape):
(1) **origin write-side lock** at DBP=0/DBW=2 (DBW must be even
per PCSX2 GSLocalMemory.h:553 — PSMT8 pages are 128 px wide vs
FBW's 64-px units, so 2 FBW units per page → bw_pg=1 here).
16×8 PSMT8 image upload via 8 IMAGE qwords (16 px/qword). Per-
pixel index `idx_at(x, y) = (y[2:0] << 4) | x[3:0]`
[0x00..0x7F]. After upload, byte-readback at the swizzled
address asserts each byte landed where the swizzle says. Strict
separators: linear y=1 (byte 128) and y=2 (byte 256) row starts
stay 0 — swizzled write set lives entirely in [0..127].
(2) **end-to-end agreement**: enable Ch132 swizzled scanout on
the same VRAM, capture the frame, verify each visible pixel's
PCRTC PSMT8 grayscale R=G=B matches `idx_at(x, y)`. Both modules
instantiate the same `gs_swizzle_psmt8_stub` so success proves
byte-level agreement under TRXDIR-style emit + scanout-style
read. (3) **non-origin transfer** at DBP=8/DBW=2/DSAX=12/DSAY=10/
RRW=8/RRH=8. Effective coords (12..19, 10..17) cross block_x=0→1
at effective_x=16 AND block_y=0→1 at effective_y=16, so all 4
corner blocks of page (0,0) at DBP=8 (blockTable8[0][0]=0,
[0][1]=1, [1][0]=2, [1][1]=3 → block bases 2048/2304/2560/2816)
are visited. Pins three contracts the origin transfer can't:
`dest_base_q = DBP*256` ADDED ON TOP; the swizzle is fed FULL
effective coords (DSAX/DSAY non-zero); BOTH block_x and block_y
propagate through `blockTable8[by][bx]`. Phase 3 distinct-pixel
pattern uses `p3_idx = 0x80 | idx` ∈ [0x80..0xFF] (disjoint
from Phase 1's [0x00..0x7F]) so a P3 pixel landing at a P1
byte (or vice versa) surfaces as wrong RGB. Phase 3 strict
separator: linear formula puts effective coord (12, 10) at
byte `2048 + 10*128 + 12 = 3340` (outside swizzled set
[2048..3071]); byte 3340 stays 0 — proves a fall-through to
linear would have stomped that byte. **First-attempt PASS**:
arms=2, writes=192 (=128+64), errors=0. NOTE: at Ch133 only,
PSMT8 raster-side emits via `gs_stub` still used linear
addressing — Ch133 was image-xfer write-side only. Ch134 later
closed the raster-side gate via `PSMT8_SWIZZLE` on `gs_stub`
(mirrors Ch122 for PSMCT32 and Ch128 for PSMCT16) — see Ch134
row above.
- `tb_gs_scanout_swizzle_psmt4.sv` (Ch138) — focused contract for
the new `PSMT4_SWIZZLE` parameter on `gs_pcrtc_stub`. Mirrors
Ch120/Ch126/Ch132's read-side-first wiring shape but adds the
PSMT4-specific twist: the swizzle module outputs both an
absolute byte address AND a `nibble_hi` selector (PSMT4 = 4
bits/pixel = 2 pixels per byte, and the canonical PCSX2 column
table reorders nibbles within a block, so `pixel_index[0]`
is no longer the right selector under the swizzled layout).
When the parameter is 1 AND the active PSM is PSMT4, scanout
reads go through the Ch137 `gs_swizzle_psmt4_stub` and the
PSMT4 nibble extractor uses `swizzle4_nibble_hi` instead of
`pixel_index[0]`. PSMCT32/PSMCT16/PSMT8 are gated by their own
parameters; default 0 keeps every existing PSMT4 scanout TB
(Ch103 PSMT4+CLUT, Ch104 PSMT4 round-trip, Ch107 PSMT4 e2e,
etc.) on the legacy linear path. No new ports — parameter-
only API change. Default-off smoke verification: ran Ch103
`tb_gs_scanout_psmt4_clut` + Ch104 `tb_gs_psmt4_round_trip`
before writing the new TB; both PASSed unchanged. **Two-phase
verification** (mirrors Ch132 closure shape; CLUT disabled so
PCRTC's PSMT4 grayscale fallback gives `r=g=b={nibble,
nibble}` at scanout):
(1) **origin** at FBP=0/FBW=2/DBX=DBY=0 (FBW must be even per
PCSX2 GSLocalMemory.h:560 because PSMT4 pages are 128 px wide,
same as PSMT8). 16×4 region preloaded at swizzled bytes via a
TB-side `byte_shadow` accumulator that lays each pixel's
nibble at its `(addr, nibble_hi)` slot; bytes are then flushed
to vram_stub via per-byte BE writes. Per-pixel nibble pattern
`nibble_at(x, y) = ((y << 1) ^ x) & 4'h7` ∈ [0..7] gives unique
gray values across the 16×4 frame. The image lives entirely
in block (0,0) of page (0,0) and exercises within-block
columnTable4 entries for yb=0..3, xb=0..15. Strict separator:
byte 64 (linear y=1 row start at FBW=2 stride) pre-colored
with sentinel 0xCC (gray=0xCC, unproducible by Phase 1's
[0..7]-nibble pattern) — fall-through to linear would surface
as RGB(0xCC, 0xCC, 0xCC).
(2) **non-origin** at FBP=4/FBW=4 (bw_pg=2), DBX=120, DBY=126.
Effective coords range x∈[120..135], y∈[126..129]. page_x
crosses 0→1 at effective_x=128, page_y crosses 0→1 at
effective_y=128 (PSMT4's 128-tall page boundary — different
from PSMT8's 64-tall). All 4 corner pages of FBP=4/FBW=4
visited, each with a distinct blockTable4 lookup
(blockTable4[7][3]=31 → page (0,0) block_base 16128;
blockTable4[7][0]=21 → page (1,0) block_base 21760;
blockTable4[0][3]=10 → page (0,1) block_base 27136;
blockTable4[0][0]=0 → page (1,1) block_base 32768). A
regression that tied any of {dispfb_fbp, dbx, dby, FBW,
block_x, block_y, page_index, bw_pg=FBW/2, swizzle
nibble_hi} to zero would NOT survive Phase 2. Strict P2
separator: byte 24380 (linear formula's place for (120, 126);
outside all 4 swizzled chunks) pre-colored with sentinel 0xDD
→ fall-through to linear would surface as RGB(0xDD, 0xDD,
0xDD), unproducible by the Phase-2 pattern. **PASS** errors=0
after one bug-fix iteration: Phase 2's flush-loop initially
hardcoded the wrong byte ranges due to a `blockTable4[7][3]`
lookup mistake (the value is 31, not 15) — replaced with a
shadow-array sweep [256..65535] that flushes any non-zero
byte, eliminating the hardcode/lookup mismatch class entirely.
NOTE (now historical): Ch138 was read-side only when
introduced; the PSMT4 write-side is now live as well — Ch139
(image-xfer) + Ch140 (raster) + Ch141 (raster-driven e2e
demo). With Ch138, **all four common GS PSMs now have read-
side byte-accuracy under their swizzle gates** (CT32 Ch120 +
CT16 Ch126 + T8 Ch132 + T4 Ch138).
- `tb_gs_scanout_swizzle_psmt8.sv` (Ch132) — focused contract for
the new `PSMT8_SWIZZLE` parameter on `gs_pcrtc_stub`. Mirrors
Ch120/Ch126's wiring shape but for PSMT8: when the parameter is
1 AND the active PSM is PSMT8, scanout reads go through the
Ch131 `gs_swizzle_psmt8_stub` (real PS2 GS page/block/column
layout — 128×64 pixel pages, 4×8 block grid, 16×16 within-block
bytes, `bw_pg = FBW>>1`) instead of the legacy linear
`FBW*64*y + x` formula. PSMCT32/PSMCT16 are gated by their own
parameters; PSMT4 stays linear (its swizzle math is future).
Default PSMT8_SWIZZLE=0 keeps every existing PSMT8 scanout TB
(Ch96 storage-only, Ch97 PSMT8+CLUT, Ch103 PSMT4-via-CT16-CLUT,
Ch107 PSMT4-e2e palette path) on the original linear addressing.
No new ports — parameter-only API change. Default-off smoke
verification: ran Ch96 `tb_gs_scanout_psmt8` before writing the
new TB; PASSed unchanged, confirming the new instance + 4-way
mux extension don't disturb the linear path. **Two-phase
verification** (mirrors Ch126 PSMCT16 closure shape):
(1) **origin** (FBP=0, FBW=2, DBX=DBY=0; FBW must be even —
PCSX2 asserts `(bw & 1) == 0` for PSMT8 because pages are 128 px
wide vs FBW's 64-px units, so 2 FBW units per page → bw_pg=1
here). 16×8 region preloaded at swizzled bytes; per-pixel index
`idx = (y[2:0] << 4) | x[3:0]` ∈ [0x00..0x7F] surfaces as
grayscale R=G=B=idx via PCRTC's PSMT8 fallback (Ch96). x∈[0..15]
is entirely block_x_in_page=0, so the within-block test
exercises ALL 16 xb positions of `columnTable8` across yb rows
0..7. Strict separators: linear y=1 starts at byte 128 (FBW=2
stride) but swizzled lands at byte 8 (`columnTable8[1][0]=8`,
no `*2` scale since PSMT8 is 1 byte/pixel); linear x=8,y=0 is
byte 8 but swizzled is byte 2. (2) **non-origin** (FBP=4,
FBW=4 → bw_pg=2, DBX=120, DBY=60). Effective coords range
x∈[120..135], y∈[60..67] — page_x crosses 0→1 at effective_x=128
(proves x[7] reaches the page-x lane of the PSMT8 swizzle —
different boundary from CT16/CT32's x[6]); page_y crosses 0→1
at effective_y=64; block_x and block_y both flip; ALL 4 pages
(0,0)/(1,0)/(0,1)/(1,1) are visited, each with a distinct
blockTable8 lookup ([3][7]=31, [3][0]=10, [0][7]=21, [0][0]=0).
A regression that tied any of {dispfb_fbp, dbx, dby, FBW,
block_x, block_y, page_index, bw_pg=FBW/2} to zero would NOT
survive Phase 2. **Sentinel separator**: byte 24500 (inside
linear range 23672..25479 for the Phase-2 effective region,
outside ALL 4 swizzled write-set blocks) pre-colored with 0xFF
→ fall-through to linear would surface as RGB(0xFF, 0xFF, 0xFF),
which is unproducible by the Phase-2 unique pattern (idx ∈
[0x00..0x7F]). **First-attempt PASS** errors=0 — no audit
iteration needed because Phase 2's coord choices were designed
up front to make all 7 chain-layer wires load-bearing AND the
page-x crossing boundary is at PSMT8's specific x=128 (not the
64-px boundary the direct-color PSMs use). NOTE (now historical):
Ch132 was read-side only when introduced; Ch133 then Ch134
closed the image-xfer + raster write sides for PSMT8, so all
three PSMT8 swizzle integration points are now live (mirrors
Ch120/Ch121/Ch122 for PSMCT32 and Ch126/Ch127/Ch128 for PSMCT16).
- `tb_gs_scanout_swizzle_psmct16.sv` (Ch126) — focused contract
for the new `PSMCT16_SWIZZLE` parameter on `gs_pcrtc_stub`.
Mirrors Ch120's wiring shape but for PSMCT16: when the
parameter is 1 AND the active PSM is PSMCT16, scanout reads
go through the Ch125 `gs_swizzle_psmct16_stub` (real PS2 GS
page/block/column layout) instead of the legacy linear
`FBW*64*y + x*2` formula. PSMCT32 is gated by its own
`PSMCT32_SWIZZLE` parameter (Ch120); PSMT8/PSMT4 stay linear.
Default 0 keeps every existing PSMCT16 scanout TB
(Ch94/Ch95/Ch103/etc.) on the original linear addressing.
Topology: TB drives `vram_stub.write_*` directly with each
pixel's RGB5A1 halfword preloaded at the swizzled byte address
(TB-side `ref_addr16()` mirrors the swizzle math + the Ch125
source-table-locked tables); pcrtc with `PSMCT16_SWIZZLE=1`
scans out the 16×8 frame and the TB asserts each captured
pixel matches the preloaded color after 5→8 bit-replicate.
Per-pixel pattern is unique per (x, y): R5=`(x^y)&0xF`,
G5=`x&0xF`, B5=`y&0xF`, expanded to 8 bits via PCRTC's
bit-replicate. The PSMCT16 swizzle vs. linear distinction
shows up at any y>0 (linear y=1 → byte 128 with FBW=1, but
swizzled within block (0,0) yb=1 → columnTable16[1][0]=4
→ byte 8) and at x=8, y=0 (linear byte 16 vs swizzled byte 2)
so even within the first row + first block, the gate is a
strict separator. NOTE (now historical): Ch126 was read-side
only when introduced; Ch127 (image-xfer) then Ch128 (raster)
closed the PSMCT16 write sides, mirroring Ch121/Ch122 for
PSMCT32.
- `tb_gs_swizzle_psmt4.sv` (Ch137) — focused contract for the new
`gs_swizzle_psmt4_stub` math primitive: a pure-comb module mapping
`(FBP, FBW, x, y)` to a VRAM **byte address + nibble_hi selector**
using the real PS2 GS PSMT4 layout (8 KiB pages organized as
128×128 PSMT4 pixels — 4× as many pixels per page as PSMT8 since
each PSMT4 pixel is a NIBBLE; 32 blocks/page in an 8-rows × 4-cols
grid (same orientation as PSMCT16's blockTable16); each block
32×16 pixels = 512 nibbles = 256 bytes; **512-entry within-block
column table** — 2× the entries of PSMT8's 256-entry table due to
the doubled block area, indexed [yb][xb] with yb=0..15 + xb=0..31
→ nibble 0..511). PSMT4 is the most complex of the four common GS
PSMs because each pixel is HALF a byte, so the swizzle outputs
both a byte address and a `nibble_hi` selector (=0 for low
nibble of the byte at `addr`, =1 for high). PSMT4 reuses PSMT8's
page-stride convention (`bw_pg = FBW >> 1`; PCSX2 asserts FBW
must be even at GSLocalMemory.h:560 because PSMT4 pages are 128
px wide). Source-table provenance pinned: `_blockTable4` taken
verbatim from pcsx2/GS/GSTables.cpp lines 6169; `columnTable4`
from same file lines 147213. Master HEAD commit
`3000e113e2b3a76357c08dfa80d3c747f40e2706`; file blob SHA
`3581209b8217378f473f9de22a9dbc8c45ca49b6` (same blob Ch131
pinned). Cross-checked against GSLocalMemory.h:558
`BlockNumber4` + the `pxOffset` template at GSTables.cpp:247258
(blockSize=512 in NIBBLE units, pageSize=16384 nibble units =
8192 bytes, pageWidth=128). The existing per-bit write_mask
0x0F/0xF0 nibble RMW from Ch106/Ch118 will still apply on top
of the swizzled byte address — the swizzle module doesn't touch
the nibble merge logic; it just produces (addr, nibble_hi).
**Five-phase verification** (mirrors Ch125/Ch131 shape, scaled
up): (1) **spot-checks** at 15 hand-computed corners (origin,
intra-block xb=1/8/16/yb=1/yb=2-with-hi-nibble, last nibble of
block (0,0), first/second/third/fourth horizontal blocks,
second-row-of-blocks origin, page-x at x=128 + page-y at y=128,
FBP=4 origin, page0-last-pixel (127,127) → addr 8191 hi=1).
(2a) **INDEPENDENT column-table source lock** — 32 hard-coded
`check_nibble()` calls for yb=0 (literal-by-literal verbatim
from PCSX2 columnTable4 row 0) PLUS a programmatic walk for
yb=1..15 against the in-TB ref function (480 more checks);
Phase 2a's literal yb=0 row + Phase 5's bijectivity sweep +
Phase 3's literal block-table lock together pin the table.
(3) **INDEPENDENT block-table source lock** — 32 hard-coded
checks (one per block in page 0) with expected block index
taken VERBATIM from PCSX2 blockTable4. (4) block-swizzle walk
via in-TB ref_block_idx4. (5) **bijectivity sweep over the
128×128 page** — 16384 NIBBLE slots (vs PSMT8's 8192 byte
slots), every pixel must hit a unique (byte_addr, nibble_hi)
pair and agree with both the in-TB ref byte address AND
ref nibble_hi. Plus multi-page sanity at FBW=4/bw_pg=2
(page-x crossing at x=192 → byte 10496 with blockTable4[1][2]=9,
and page-y crossing at y=128 → byte 16384) and non-page-aligned
FBP coverage at FBP ∈ {1,2,3}, including FBP=3+FBW=4+page-(1,1)
intra-block at (129, 129) → byte 30732 (= 6144 + 3*8192 + 0*256
+ ref_col_idx4(1,1)/2 = 30720 + 12). **First-attempt PASS**
errors=0. NOTE: This module is NOT YET wired into
`gs_pcrtc_stub` / `gif_image_xfer_stub` / `gs_stub` — those
still use linear PSMT4 addressing as of Ch137. The math is
locked here so follow-on chapters can wire `PSMT4_SWIZZLE`
parameter gates into the existing address paths without
disturbing the legacy linear-PSMT4 TBs (Ch103 / Ch106 / Ch107
/ Ch118). With Ch119 PSMCT32 + Ch125 PSMCT16 + Ch131 PSMT8 +
Ch137 PSMT4, **all four common GS PSMs now have byte-accurate-
to-real-PS2 swizzle math available as standalone primitives** —
the four-PSM swizzle math foundation is complete. Future
chapters can wire PSMT4 into pcrtc/image-xfer/raster behind a
PSMT4_SWIZZLE parameter (mirroring Ch120→Ch124 / Ch126→Ch130
/ Ch132→Ch136), with the existing nibble RMW machinery layered
on top.
- `tb_gs_swizzle_psmt8.sv` (Ch131) — focused contract for the new
`gs_swizzle_psmt8_stub` math primitive: a pure-comb module mapping
`(FBP, FBW, x, y)` to a VRAM byte address using the real PS2 GS
PSMT8 layout (8 KiB pages organized as 128×64 PSMT8 pixels — 2×
wider than CT16's 64×64 page; 32 blocks/page in a 4-rows × 8-cols
grid; each block 16×16 pixels = 256 bytes; **256-entry within-
block column table** — 2× the entries of CT16's 128-entry table
due to the doubled block area, indexed [yb][xb] with yb=0..15 +
xb=0..15 → byte 0..255). PSMT8 also introduces a new page-stride
constant `bw_pg = FBW >> 1` (PCSX2 asserts `(bw & 1) == 0` at
GSLocalMemory.h:553 because PSMT8 pages are 128 px wide vs FBW's
64-px units → 2 FBW units per PSMT8 page, so FBW must be even).
Source-table provenance pinned: `blockTable8` taken verbatim from
pcsx2/GS/GSTables.cpp lines 5359; `columnTable8` from same file
lines 111145. Master HEAD commit
`3000e113e2b3a76357c08dfa80d3c747f40e2706`; file blob SHA
`3581209b8217378f473f9de22a9dbc8c45ca49b6`. Cross-checked against
GSLocalMemory.h:551 `BlockNumber8` + the `pxOffset` template at
GSTables.cpp:247258 (blockSize=256, pageSize=8192, pageWidth=128).
PCSX2's `bp` is in 256-byte block-pointer units; in our
FBP=2048-byte units, `bp = FBP * 8` so `bp*256 = FBP*2048`.
**Five-phase verification** (mirrors Ch125 PSMCT16 shape):
(1) **spot-checks** at 15 hand-computed corners (origin, intra-
block xb=1/4/8/yb=1, last byte of block (0,0), first/second block
origins, second row of blocks, third+fourth blocks, page-x at
x=128 and page-y at y=64, FBP=4 origin); (2a) **INDEPENDENT
column-table source lock** — 256 hard-coded `check()` calls (one
per (yb, xb) inside block (0,0)) where the expected byte index is
taken VERBATIM from PCSX2 columnTable8 with `<literal>` arithmetic,
NOT derived from the in-TB ref function. Catches any case where
DUT and ref share the same miscopy (the same trap Ch125 added
Phase 2a for with PSMCT16's column table); (2b) within-block
16×16 walk via the in-TB ref_col_idx8 (self-check); (3)
**INDEPENDENT block-table source lock** — 32 hard-coded checks
(one per block in page 0) with the expected block index taken
VERBATIM from PCSX2 blockTable8, NOT derived from the in-TB ref;
(4) block-swizzle walk via in-TB ref_block_idx8; (5)
**bijectivity sweep over the 128×64 page** — 8192 byte slots
(vs CT16's 4096 halfword slots), every pixel must hit a unique
byte address in `[0, 8192)` and agree with the in-TB reference.
Plus multi-page sanity at FBW=4/bw_pg=2 (page-x crossing at
x=192 and page-y crossing at y=64) and non-page-aligned FBP
coverage at FBP ∈ {1, 2, 3}, including FBP=3+FBW=4+page-(1,1)
intra-block crossing at (129, 65). **First-attempt PASS**
errors=0. NOTE: This module is NOT YET wired into
`gs_pcrtc_stub` / `gif_image_xfer_stub` / `gs_stub` — those
still use linear PSMT8 addressing as of Ch131. The math is
locked here so follow-on chapters can wire `PSMT8_SWIZZLE`
parameter gates into the existing address paths without
disturbing the legacy linear-PSMT8 TBs (Ch96 / Ch97 / Ch103 /
Ch105 / Ch107 / Ch117). With Ch119 PSMCT32 + Ch125 PSMCT16 +
Ch131 PSMT8, three of the four common GS PSMs now have byte-
accurate-to-real-PS2 swizzle math available as standalone
primitives; PSMT4 (with its 32×16 nibble intra-block layout) is
the natural Ch132 candidate.
- `tb_gs_swizzle_psmct16.sv` (Ch125) — focused contract for the
new `gs_swizzle_psmct16_stub` math primitive: a pure-comb module
mapping `(FBP, FBW, x, y)` to a VRAM byte address using the real
PS2 GS PSMCT16 layout (8 KiB pages organized as 64×64 PSMCT16
pixels; 32 blocks/page in a 4×8 grid; each block 16×8 pixels =
256 bytes; **non-trivial within-block column table** — unlike
PSMCT32 where within-block IS row-major halfwords by accident,
PSMCT16 has 4 internal 16×2-pixel sub-columns with a 128-entry
permutation). Source-table provenance pinned: `blockTable16`
taken verbatim from pcsx2/GS/GSTables.cpp lines 2939
(master HEAD commit 3d71e310; file-touch commit d983b2b0,
2026-01-12); `columnTable16` from same file lines 91109.
Cross-check against the older Debian-packaged GSdx
`PixelAddressOrg16(x, y, bp, bw) = (BlockNumber16(...) << 7) +
columnTable16[y & 7][x & 15]` confirms the address chain
(`<< 7` lifts to halfword units, multiply by 2 for bytes; in
our FBP=2048-byte units, bp = FBP * 8 so bp*256 = FBP*2048).
**Five-phase verification**: (1) spot-checks at 13 well-defined
corners (origin, intra-block, first/second block, second row of
blocks, page-x and page-y boundaries, FBP=4 origin); (2)
within-block 16×8 walk asserting `byte = 2 * columnTable16[yb][xb]`
— locks the column table; a row-major-halfwords regression would
fail; (3) **source-table lock** — 32 hard-coded address checks
(one per block in page 0) with the expected block index taken
VERBATIM from PCSX2 blockTable16, NOT derived from the in-TB
reference function; (4) block-swizzle walk cross-checking the
in-TB ref function against the DUT (the bijectivity sweep
relies on it being correct); (5) **bijectivity sweep over the
64×64 page** — 4096 halfword slots, every pixel must hit a
unique halfword address in `[0, 8192)` and agree with the in-TB
reference. Plus multi-page sanity at FBW=2 and non-page-aligned
FBP coverage at FBP ∈ {1, 2, 3} (real PS2 supports any
2048-byte-aligned FBP — same broadening Ch119 adopted post-
audit). NOTE: This module is NOT YET wired into `gs_pcrtc_stub`
/ `gif_image_xfer_stub` / `gs_stub` — those still use linear
PSMCT16 addressing as of Ch125. The math is locked here so
follow-on chapters can wire `PSMCT16_SWIZZLE` parameter gates
into the existing address paths without disturbing the legacy
linear-PSMCT16 TBs (Ch94 / Ch95 / Ch103 / Ch116).
- `tb_gs_swizzle_psmct32.sv` (Ch119) — focused contract for the
new `gs_swizzle_psmct32_stub` math primitive: a pure-combinational
module mapping `(FBP, FBW, x, y)` to a VRAM byte address using
the real PS2 GS PSMCT32 page/block swizzle layout (8 KiB pages,
4×8 grid of 8×8-pixel blocks per page, blocks ordered per the
canonical PCSX2 PSMCT32 swizzle table, row-major within a block).
Verification has five phases: (1) spot-checks on the well-defined
corners — origin, intra-block walks, first/second block, second
row of blocks, page-x and page-y boundaries, second page on x,
and FBP=4 origin; (2) within-block 8×8 walk asserting
`byte_in_block = yb*32 + xb*4`; (3) **source-table lock** — 32
hard-coded address checks (one per block in page 0) where the
expected block index is taken VERBATIM from PCSX2's PSMCT32 block
table, NOT derived from the in-TB reference function. This proves
the DUT's `swizzle_psmct32()` table matches the canonical source;
a copied-wrong table that happened to still be a valid permutation
of 0..31 would fail this phase, while the bijectivity sweep below
would pass it; (4) block-swizzle walk (redundant with phase 3,
cross-checks ref_block_idx against the DUT — the bijectivity
sweep relies on ref_block_idx being correct); (5) bijectivity
sweep over the full 64×32 PSMCT32 page — every word slot in
`[0, 8192)` reached exactly once (catches any swap/typo in the
swizzle table). Plus a multi-page sanity check at FBW=2 (pixel
(96, 16) → block (4,2) of page 1 → addr 14336) and a **non-page-
aligned FBP** phase that drives FBP=1, 2, 3 (mid-page in the 8 KiB
sense — real PS2 supports any 2048-byte-aligned FBP; our address
formula is bit-correct for non-page-aligned FBP) plus FBP=3 with
FBW=2 + intra-block crossing as a stress case. NOTE (now
historical): at Ch119 this module was standalone math only;
Ch120 (PCRTC read), Ch121 (image-xfer write), and Ch122
(raster write) wired it into the three integration points —
the same shape that Ch125Ch128 (PSMCT16), Ch131Ch134
(PSMT8), and Ch137Ch140 (PSMT4) followed for the other
three PSMs.
- `tb_gs_image_xfer_psmt4.sv` (Ch118) — focused contract for
`gif_image_xfer_stub`'s PSMT4 path (the fourth and final
supported PSM). PSMT4 packs 0.5 bytes/pixel (4-bit nibble per
pixel = 2 px/byte), so each 128-bit IMAGE qword carries 32
pixels in 16 bytes. Each emit is a SUB-BYTE write: `write_be
= 4'b0001` with a per-emit nibble mask
(`write_mask = 0x0000_000F` for the LOW nibble,
`0x0000_00F0` for the HIGH nibble), keyed by `(DSAX+x)[0]`;
vram_stub's per-bit merge commits exactly the targeted
nibble, preserving the OTHER nibble of the byte.
Back-to-back emits to the same byte (e.g. x=0 + x=1 of the
same row) chain through NBA semantics without bypass logic
— the same trick the raster channel uses since Ch106. The TB
is INTENTIONALLY adversarial: VRAM is preloaded with `0xA5`
across every byte the engine will write (plus boundary
bytes), then a single IMAGE qword (32 PSMT4 pixels) covers
the entire 8×4 rect. Every byte ends as
`{nibble_high_pixel, nibble_low_pixel}` (no trace of 0xA5);
bytes immediately right of the rect on each row stay 0xA5
(proves no nibble leak past RRW); bytes before / after the
destination region also stay 0xA5. Pattern
`pixel(x,y) = 4'((y*8+x) & 0xF)`. Asserts: 1 trxdir arm, 32
vram writes, every emit `be=0001` and `mask ∈ {0x0F, 0xF0}`,
per-byte readback matches, boundary bytes preserved.
- `tb_gs_image_xfer_psmt8.sv` (Ch117) — focused contract for
`gif_image_xfer_stub`'s PSMT8 path. Pushes 2 IMAGE qwords
(32 PSMT8 pixels = 16 px/qword × 2) through the engine after
a TRXDIR-shaped GIF-A+D register sequence with DPSM=PSMT8
(=0x13). PSMT8 packs 1 byte/pixel (an 8-bit CLUT index), so
each qword holds 16 pixels; the engine emits one 8-bit pixel
per cycle with `write_be = 4'b0001`, the index in the LOW
byte of `write_data`, and `write_mask = 0xFFFFFFFF`;
vram_stub commits `mem[write_addr] <= write_data[7:0]` at
any byte alignment. Pattern is `pixel(x,y) = 8'(y*16 + x)`
32 distinct values across the 8×4 rect so a wrong-byte-lane
commit shows up unambiguously. Asserts: 1 trxdir arm, 32
vram writes (all `be=0001`, `mask=0xFFFFFFFF`), every pixel
reads back at `dest_base + y*64 + x`, plus right-of-rect /
before / after byte-zero boundary preservation. Each qword
packs TWO rows of 8 pixels (lanes 0..7 = row y, lanes 8..15
= row y+1) — exercises the per-lane row-stride math at the
qword boundary.
- `tb_gs_image_xfer_psmct16.sv` (Ch116) — focused contract for
`gif_image_xfer_stub`'s new PSMCT16 path. Pushes 4 IMAGE
qwords (32 PSMCT16 pixels = 8 px/qword × 4) through the
engine after a TRXDIR-shaped GIF-A+D register sequence
(BITBLTBUF/TRXPOS/TRXREG/TRXDIR). PSMCT16 packs 2 bytes/pixel,
so each qword holds 8 pixels (vs 4 for PSMCT32). The engine
emits one 16-bit pixel per cycle to vram_stub with
`write_be = 4'b0011`, the pixel value in the LOW halfword of
`write_data`, and `write_mask = 0xFFFFFFFF`; vram_stub commits
the 2 bytes at the 2-byte-aligned destination address. Pattern
is `pixel(x,y) = 16'h{yyxx}{yyxx}` — distinct per-pixel value
so a wrong-lane commit shows up unambiguously. Asserts:
1 trxdir arm, 32 vram writes (all `be=0011`, `mask=0xFFFFFFFF`),
every pixel reads back at `dest_base + y*row_stride + x*2`,
and the bytes immediately right of the rect on each row +
before the dest region + after the dest region all stay zero
(proves row-stride math + no halfword leak past RRW). PSMT8
image-xfer landed in Ch117 and PSMT4 image-xfer landed in
Ch118 — see those TB rows for their own per-byte / per-nibble
contract coverage.
- `tb_gs_demo_psmt4_e2e_trxdir.sv` (Ch110) — driver-shaped
PSMT4 demo with the palette upload now arriving via a real
TRXDIR/TRXPOS/TRXREG/HWREG image-transfer GIF packet sequence
instead of TB-direct vram_stub writes. Closes the LAST
TB-direct path in the e2e demo flow: every byte the GS sees —
framebuffer pixels AND palette source — now arrives through a
driver-shaped GIF stream. The DMAC delivers 36 qwords total:
U1 (PACKED, NREG=4): BITBLTBUF/TRXPOS/TRXREG/TRXDIR — TRXDIR
arms `gif_image_xfer_stub`. U2 (IMAGE, NLOOP=4): 4 qwords of 4
PSMCT32 entries each → 16 palette entries written into VRAM at
DBP*256 by `gif_image_xfer_stub`. Then 4 SPRITE PACKED packets
+ 1 TEX0_1 PACKED packet. PASS criteria add to Ch109's:
**1 EV_DMA_START / 36 EV_DMA_BEAT / 1 EV_DMA_DONE**, **7
GIFtag accepts** (U1 + U2 + 4×SPRITE + TEX0), **25 PACKED A+D
dispatches** (4 TRX-setup + 20 SPRITE + 1 TEX0), **16
image-xfer VRAM writes** from `gif_image_xfer_stub` (DBP=4,
DBW=1, DPSM=PSMCT32, DSAX=DSAY=0, RRW=16, RRH=1). The vram_stub
write port is muxed at TB level: `xfer_busy ? xfer_we :
raster_pixel_emit` (sequenced — palette upload completes before
sprites raster). Ch110 also added a backpressure path on
`gif_packed_stub` (`image_data_ready` input) so the upstream
DMA stalls while `gif_image_xfer_stub` is draining the previous
IMAGE qword's 4 PSMCT32 lanes; outside S_IMAGE the gate is a
no-op (in_ready stays high). Privileged-block MMIO (PMODE/
DISPFB1/DISPLAY1) remains TB-direct because those are CPU MMIO
writes in real hardware, not GIF traffic.
- `tb_gs_demo_psmt4_e2e_dmac.sv` (Ch109) — same 4-quadrant
PSMT4 demo as Ch108, but the GIFtag + PACKED A+D quadwords
arrive at `gif_packed_stub` via the DMAC channel-2 →
`ee_memory_map_stub``ee_ram_stub` path instead of being
TB-driven directly. Closes the last GIF-side sideband from
Ch108: the demo is now reachable the way real EE/IOP code
reaches it. The TB pre-stages the same 26 qwords (4 SPRITE
packets × 6 qwords + 1 TEX0_1 packet × 2 qwords) into RAM at
PAYLOAD_MADR, then writes DMAC channel-2 MADR/QWC/CHCR; a
single NORMAL transfer with QWC=26 streams them into the GIF.
PASS criteria add to Ch108's: **1 EV_DMA_START / 26
EV_DMA_BEAT / 1 EV_DMA_DONE** (DMA event taxonomy locked),
with the same downstream chain — 5 GIFtag accepts, 21 A+D
dispatches in the expected reg-num sequence, 32 PSMT4 emits,
1 loader_busy rise, identical 16×8 captured frame. Privileged-
block MMIO and palette pre-stage stay TB-direct (NOT GIF-side);
TRXDIR/HWREG image-transfer for palette upload is a separate
future chapter.
- `tb_gs_demo_psmt4_e2e_packed.sv` (Ch108) — same 4-quadrant
PSMT4 demo as Ch107 but routed through the GIFtag / PACKED
A+D front-end (`gif_packed_stub` with REAL_AD_REG_MAP=1).
Closes the last bit of GS-side sideband from Ch107: instead
of TB-driving `gs_stub.gif_reg_*` directly, the TB pushes raw
128-bit GIFtag + PACKED A+D quadwords into `gif_packed_stub.
in_*` exactly the way the real GIF would receive them from
PATH3. Each SPRITE is a packet of 1 GIFtag (NLOOP=1, NREG=5,
PACKED, REGS=0xEEEEE — 5×A+D in the low 5 nibble slots) +
5 PACKED A+D qwords (PRIM, FRAME_1=PSMT4, RGBAQ, XYZ2, XYZ2);
TEX0_1 load is its own 1-tag/1-A+D packet. Total: 5 GIFtag
accepts (4 SPRITEs + 1 TEX0_1) and 4×5 + 1×1 = 21 PACKED A+D
register-write dispatches into gs_stub.gif_reg_*. 32 PSMT4
raster emits arrive (Ch106 RMW), loader fires exactly once
on TEX0_1, and the captured 16×8 frame matches the same
expected CLUT-decoded RGB as Ch107 — i.e. real-format GIF
packets reach the GS register file with the same cadence the
TB previously synthesised by hand. Privileged-block MMIO
(PMODE/DISPFB1/DISPLAY1) and the palette pre-stage in VRAM
remain TB-direct because they are NOT GIF-side; the palette
upload via real-PS2 TRXDIR/TRXPOS/TRXREG/HWREG image-transfer
packets is a separate future chapter, as is the DMAC channel-2
burst that would normally deliver the GIFtag qwords (this TB
drives `gif_packed_stub.in_*` directly to keep the demo
narrow and deterministic; the full DMAC→RAM→GIF round trip
is what the integration-tier `tb_ee_core_gif_*` family
covers).
- `tb_gs_psmt4_round_trip.sv` (Ch104) — full driver-shaped
PSMT4 + CLD=4 + CSA round trip. Wires `gs_stub` +
`vram_stub` + `clut_stub` + `clut_loader_stub` + `gs_pcrtc_stub`
end-to-end with `pcrtc.clut_csa = gs_stub.tex0_1_csa_q` (the
Ch98 sideband-free pattern). Phase 1: stages a 4×4 PSMT4 sprite
in VRAM, plus a 16-entry pattern_a palette in VRAM at
`CBP_A*256`. Drives TEX0_1 with `CBP=4, CPSM=PSMCT32, CSM=CSM2,
CSA=0, CLD=4`; the loader writes pattern_a into `clut_stub[0..15]`
and `pcrtc.clut_csa` is 0, so PSMT4 scanout reads pattern_a per
nibble. Phase 2: stages a different pattern_b palette at
`CBP_B*256` and drives TEX0_1 with `CBP=8, CSA=4, CLD=4`; the
loader writes pattern_b into `clut_stub[64..79]` (the CSA=4
window) and `pcrtc.clut_csa` flips to 4, so the same VRAM
sprite — same DISPFB1 / DISPLAY1 / PMODE — now reads pattern_b.
Proves loader policy + clut_stub contents + PCRTC lookup are
wired consistently.
Scope (current, after Ch165):
- **PSMCT32 (DISPFB1.PSM=0), PSMCT16 (PSM=2), PSMT8 (PSM=0x13),
and PSMT4 (PSM=0x14) honored at BOTH the read and write
sides** (Ch94 + Ch95 + Ch96 + Ch97 + Ch103 + Ch105 + Ch106).
PSMCT24/PSMCT16S/PSMZ32/etc. force scanout off and are not
contract-tested at the raster channel. The write side
(gs_stub.raster_pixel_emit) emits the four supported PSMs via
`raster_pixel_be_q` (per-byte gate) and `raster_pixel_mask_q`
(per-bit merge mask, Ch106): PSMCT32 = be `0xF` / mask
`0xFFFFFFFF`, PSMCT16 = be `0x3` / mask `0xFFFFFFFF`, PSMT8 =
be `0x1` / mask `0xFFFFFFFF`, PSMT4 = be `0x1` / mask `0x0F`
or `0xF0`. The mask path is no-op for byte-or-larger PSMs
(mem[i] = data[i] when mask_i = 0xFF) and only meaningful for
PSMT4 sub-byte writes. PSMT8 / PSMT4
scanout surfaces the index/nibble as grayscale by default;
with `clut_enable=1` (Ch97/Ch103) and a programmed
`clut_stub`, the index/nibble looks up real RGB. CLUT contents come either from a TB-direct write OR
(Ch99..Ch102) from a VRAM→CLUT load triggered by a TEX0_1 GIF
write with `CSM == 1` (CSM2 linear), `CPSM` ∈ {PSMCT32,
PSMCT16}, and a CLD value passing the policy: CLD=0 never;
CLD=1 always (full 256-entry load); CLD=2 if CBP changed since
last load (full); CLD=3 if CBP/CPSM/CSA any-changed (full);
CLD=4 always but only the 16-entry CSA window at indices
`CSA*16 + i` (Ch102 — preserves the other 240 entries);
CLD ∈ {5..7} silently no-op (reserved). `clut_loader_stub`
walks the entries via `vram_stub`'s second read port; PSMCT16
entries are unpacked with the same 5→8 bit-replicate the
scanout side uses (Ch94). CSM1 swizzle and CPSM ∉ {PSMCT32,
PSMCT16} remain deferred.
- **Single CRTC, single DISPFB**. Real PS2 has two interlace-
capable CRTCs (DISPFB1, DISPFB2). One context is enough for
TBs to verify the round trip; PMODE.EN2 + DISPFB2 + DISPLAY2
is deferred.
- **Read-side addressing**. Linear by default (legacy formula
`vram_read_addr = FBP*2048 + (effective_y*FBW*64 + effective_x)
<< bpp_shift`). Four OPTIONAL per-PSM swizzle paths gated by
parameters on `gs_pcrtc_stub`: `PSMCT32_SWIZZLE=1` (Ch120)
routes PSMCT32 reads through `gs_swizzle_psmct32_stub`;
`PSMCT16_SWIZZLE=1` (Ch126) routes PSMCT16 reads through
`gs_swizzle_psmct16_stub`; `PSMT8_SWIZZLE=1` (Ch132) routes
PSMT8 reads through `gs_swizzle_psmt8_stub` (Ch131) — FBW must
be even because PSMT8 pages are 128 px wide and the swizzle
internally divides FBW by 2; `PSMT4_SWIZZLE=1` (Ch138) routes
PSMT4 reads through `gs_swizzle_psmt4_stub` (Ch137); FBW must
be even (same as PSMT8). The four parameters are independent —
enabling one doesn't affect the others. PSMT4's swizzle module
also outputs a `nibble_hi` selector that PCRTC uses in place of
`pixel_index[0]` to pick which nibble of the byte at the
swizzled address holds this pixel (PSMT4 packs 2 pixels per
byte and the canonical PCSX2 column table reorders nibbles
within a block, so the linear formula's `pixel_index[0]`
selector is no longer correct under the swizzled layout). All
four swizzle parameter defaults are 0 so all existing PCRTC-
using TBs see the legacy linear behavior unchanged. The
PSMT4 image-xfer (Ch139) and raster (Ch140) write-side
wiring is now live as well, completing the four-PSM × three-
path swizzle integration. Both driver-shape e2e demos for
PSMT4 are also live: raster-driven (Ch141) and TRXDIR-driven
(Ch142). All four common GS PSMs now have BOTH driver-shape
e2e demos (CT32 Ch123+Ch124, CT16 Ch129+Ch130, T8 Ch135+
Ch136, T4 Ch141+Ch142) — closing the four-PSM × three-path
× dual-driver-shape e2e foundation.
- **Parallel to `platform_video_stub`, not a replacement**. We
did not extend `platform_video_stub` (which would have
rippled through 6 existing TBs). Pcrtc is the alternative
video source for TBs that want VRAM-backed scanout. The legacy
flood-fill module stays as-is.
### End-to-end demo manifest (Ch143)
Eight driver-shaped end-to-end byte-accurate demos cover the
four common GS PSMs across both driver shapes (raster-driven
PACKED-SPRITE payload + TRXDIR-driven IMAGE payload). Each demo
runs the same EE-bootlet → DMAC → GIF → GS → vram → swizzled-
PCRTC chain with all three same-PSM swizzle gates parameter-set
to 1; the listed write-side path is load-bearing and the other
write-side path is asserted dormant in the demo flow.
All eight demos emit a 16×8 framebuffer (128 pixels). The raster
column shows `(emits, xfer_writes)`; the TRXDIR column shows
`(xfer_writes, emits)` — in both cases the load-bearing path
fires 128 times and the dormant path is asserted 0.
| PSM | Raster-driven e2e | TRXDIR-driven e2e |
|---------|---------------------------------|------------------------------------|
| PSMCT32 | Ch123 — `tb_gs_demo_psmct32_swizzle_e2e` (128, 0) | Ch124 — `tb_gs_demo_psmct32_swizzle_trxdir_e2e` (128, 0) |
| PSMCT16 | Ch129 — `tb_gs_demo_psmct16_swizzle_e2e` (128, 0) | Ch130 — `tb_gs_demo_psmct16_swizzle_trxdir_e2e` (128, 0) |
| PSMT8 | Ch135 — `tb_gs_demo_psmt8_swizzle_e2e` (128, 0) | Ch136 — `tb_gs_demo_psmt8_swizzle_trxdir_e2e` (128, 0) |
| PSMT4 | Ch141 — `tb_gs_demo_psmt4_swizzle_e2e` (128, 0) | Ch142 — `tb_gs_demo_psmt4_swizzle_trxdir_e2e` (128, 0) |
For each row both demos use the same per-quadrant pixel pattern
(so the verify side is shared across the row), the same DBW-
even constraint where applicable (PSMT8 / PSMT4: 128-px-wide
pages → DBW=2 minimum even), and verification through the
freed-up `vram_stub` 2nd read port. Ch141 + Ch142 together
close the four-PSM × three-path × dual-driver-shape e2e
foundation — the foundation Ch143 manifests and seals.
**Hardware-demo candidates**:
- **PSMCT32 swizzled raster e2e (Ch123)** — simplest direct-
color path: 4 SPRITE PACKED packets, RGBAQ.{R,G,B,A} mapped
1:1 to scanout RGB, no CLUT, no nibble RMW. The natural first
hardware demo because every byte from EE-bootlet through the
swizzled 16×8 framebuffer to PCRTC RGB is visible without
any indirection. Build target: `make tb_gs_demo_psmct32_swizzle_e2e`.
- **PSMT4 swizzled TRXDIR e2e (Ch142)** — strongest indexed/
CLUT-like stress path: U1 PACKED A+D TRX setup + U2 IMAGE
NLOOP=4 with 32 PSMT4 nibbles per qword, image-xfer engine
decoding the canonical PCSX2 columnTable4 (which reorders
nibbles within a block — the linear `pixel_index[0]` rule is
wrong under swizzle), and per-pixel nibble RMW on vram_stub
via `write_be=4'b0001 + write_mask ∈ {0x0F, 0xF0}` keyed by
the swizzle's `nibble_hi`. Exercises the full sub-byte
pipeline + the canonical-source-locked column table. Build
target: `make tb_gs_demo_psmt4_swizzle_trxdir_e2e`.
### First hardware-targeted top wrapper (Ch146)
Ch146 turns the Ch144 readiness audit + Ch145 BRAM-shrink groundwork
into a real top-level SystemVerilog module: [`rtl/top/top_psmct32_raster_demo.sv`](../../rtl/top/top_psmct32_raster_demo.sv).
This is the module a board-level synthesis project would target
first. Board-level concerns (HDMI/VGA PHY, pin constraints, .mem
bake tooling, clock-domain crossings) are deliberately deferred —
Ch146 proves the design can be expressed as a single hardware-
shape module.
**Top ports**:
- `clk` / `rst_n` / `core_go` — clock, active-low synchronous reset,
start pulse (a board reset-release sequencer can tie `core_go`
high after `rst_n` deasserts).
- `r/g/b/hsync/vsync/de` — 8-bit RGB scanout from PCRTC.
- `core_halt` / `dma_done_seen` / `frame_seen` — debug/status bundle
suitable for LEDs or a board-level state observer.
**Top parameters**: `H_ACTIVE` (default 16), `V_ACTIVE` (default 8),
`BIOS_SIZE_BYTES`, `RAM_SIZE_BYTES`, `VRAM_BYTES`,
`USEG_SHADOW_WORDS_PARAM` (default 1024 = 4 KiB per Ch145).
**Image fixtures** are passed via macros (iverilog-12 string-
parameter forwarding limitation):
`TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE` and
`TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE`. The fixtures are
baked by [`sim/data/top_psmct32_raster_demo/bake.py`](../../sim/data/top_psmct32_raster_demo/bake.py)
which writes:
- `bios.mem` — 18-word EE bootlet (one 32-bit hex word per line)
- `payload.mem` — 40 qwords for ee_ram_stub (16 zero qwords +
24 GIF qwords carrying 4 SPRITE PACKED packets)
The bake script is a deterministic Python rewrite of the
procedural `ee_prog_word()` + `preload_qword()` loops in the
Ch123 TB. Same bit-exact values, just baked into static repo
artifacts so a hardware top can `$readmemh` them.
**Focused TB**: [`sim/tb/top/tb_top_psmct32_raster_demo.sv`](../../sim/tb/top/tb_top_psmct32_raster_demo.sv).
Drives the top with the static fixtures, captures one full
PCRTC frame after the EE halts and DMAC completes, and asserts
the per-quadrant RGB matches the Ch123 frame exactly. Counts:
`raster_emits=128, errors=0, core_halt=1, dma_done_seen=1,
frame_seen=1`.
**Bug-fix iteration**: the first bake had Y in XYZ2 placed at
bits[43:32] instead of bits[31:20] — a Python translation error
of the SystemVerilog `{32'd0, y_int, 4'd0, x_int, 4'd0}`
concatenation. Symptom: per-sprite emit count was 8 instead of
32 (each sprite drew one row), and VRAM held the per-sprite R
component scattered across 32 consecutive 4-byte cells. Caught
by adding a per-emit observer that printed
`(addr, data, be, mask, color_q)` for the first 10 emits.
Fix: `y << 20` instead of `y << 32` in `bake.py`. **PASS after
the fix.**
**What's still NOT in this chapter** (deferred to Ch147+):
- Real `.mem` bake tooling integration (currently the
`bake.py` is run manually before sim; a Makefile target or
CI step that invokes it would belong in Ch147).
- Board-specific top: pin constraints, target FPGA family,
PHY shim (HDMI/DVI/VGA), reset-release sequencer.
- A multi-PSM top (the Ch142 PSMT4 TRXDIR variant would be a
natural second wrapper once the build flow is proven).
### Fixture bake flow (Ch147)
Ch147 makes the Ch146 `.mem` bake first-class so the static
fixtures can't drift from `bake.py`. Three new Makefile targets:
| Target | Purpose |
|-----------------------------------------|-----------------------------------------------------------------------|
| `top_psmct32_raster_demo_mem` | Re-runs `bake.py`; produces `bios.mem` + `payload.mem` atomically. |
| `top_psmct32_raster_demo_mem_check` | Verifies fixture sizes (bios.mem = 1024 lines, payload.mem = 256). |
| `tb_top_psmct32_raster_demo` (existing) | Now declares `top_psmct32_raster_demo_mem` as a prerequisite. |
The bake target uses Make's grouped-target syntax (`&:`) so a
single `bake.py` run produces both files atomically — they can
never be out-of-step.
The size-check target counts payload lines (skipping blanks +
`// ...` comment-only lines) and asserts the exact expected
counts. A non-matching count exits with status 1, surfacing a
fixture/script drift as a hard build failure.
Deleting the fixtures and running the TB triggers the bake
automatically:
```
$ make tb_top_psmct32_raster_demo
=== bake top_psmct32_raster_demo .mem fixtures ===
python3 .../bake.py
[bake] wrote bios.mem (1024 words, 18 active) and payload.mem (256 qwords, 40 active)
=== build tb_top_psmct32_raster_demo ===
...
[tb_top_psmct32_raster_demo] PASS
```
#### Synthesis-facing macros
When pointing a synthesis tool at `rtl/top/top_psmct32_raster_demo.sv`,
two preprocessor defines must be set so `bios_rom_stub` and
`ee_ram_stub` find their `$readmemh` images. These are macros
(NOT module parameters) per the iverilog-12 string-parameter
forwarding workaround documented in the Ch146 wrapper banner;
they map cleanly to FPGA-tool defines.
| Macro | Value |
|----------------------------------------------------|----------------------------------------------------------------|
| `TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE` | Absolute (or tool-relative) path to `bios.mem` |
| `TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE` | Absolute (or tool-relative) path to `payload.mem` |
Both default to `""` so the wrapper still elaborates without
fixtures (synthetic NOP-sled in `bios_rom_stub` + zero-init
`ee_ram_stub`, which produces no DMAC payload but a stable
PCRTC frame with `r=g=b=0`).
**Vivado** (preprocessor `verilog_define` on the synthesis +
implementation filesets — these are macros, not module
generics):
```
set_property verilog_define { \
TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE="$path/bios.mem" \
TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE="$path/payload.mem" \
} [get_filesets sources_1]
```
Repeat for the implementation fileset if it diverges from
`sources_1`.
**Quartus** (project-level macro defines):
```
set_global_assignment -name VERILOG_MACRO \
"TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE=\"$path/bios.mem\""
set_global_assignment -name VERILOG_MACRO \
"TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE=\"$path/payload.mem\""
```
**Iverilog (sim)**: the Ch147 Makefile passes them via `-D`
flags in the `tb_top_psmct32_raster_demo` build rule —
`-DTOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE='"$(SIM_DIR)/data/...
/bios.mem"'` — and the `top_psmct32_raster_demo_mem`
prerequisite ensures the .mem files exist before the TB
elaborates.
### DE25-Nano synthesis scaffold (Ch148)
Ch148 makes the Ch146 hardware top synthesis-addressable on
DE25-Nano without committing to a video PHY shim or final pin
constraints (those land in Ch149+).
| File / target | Purpose |
|------------------------------------------------------------------|------------------------------------------------------------|
| `synth/de25_nano/top_psmct32_raster_demo/files.f` | RTL filelist — Ch123 dep tree only (~14 entries). |
| `synth/de25_nano/top_psmct32_raster_demo/README.md` | Top module + macros + fixtures + DE25-Nano clock/reset/video assumptions. |
| `make top_psmct32_raster_demo_synth_check` | Validates files.f paths + fixture presence. |
The synth-check target depends on `top_psmct32_raster_demo_mem_check`,
so a single command verifies fixture sizes AND that every file
referenced by the synth filelist exists. It exits non-zero on
any miss — surfacing both fixture drift (Ch147 size guard) and
filelist drift as hard build failures.
`.qsf` (Quartus pin assignments) is **not** committed in Ch148.
The README documents the board assumptions (clock domain,
reset polarity, `core_go` strategy, video-out path candidates,
LED status mapping) so the next chapter can author it without
inventing context. The point of Ch148 is that a Quartus project
import (or Vivado / `verilator --lint-only`) finds every file
the design needs, with the macros documented end-to-end.
### DE25-Nano board wrapper (Ch149)
Ch149 turns the Ch146 board-agnostic top into a real board top
without yet committing to pin assignments or a video PHY. New:
| Artifact | Purpose |
|-----------------------------------------------------------|------------------------------------------------------------------------|
| `rtl/top/de25_nano_psmct32_raster_demo_top.sv` | Board wrapper — DE25-Nano signal names + reset sequencer + LED status. |
| `sim/tb/top/tb_de25_nano_psmct32_raster_demo_top.sv` | Smoke TB exercising clock/reset/core_go/LED/video pins. |
**Top ports** (matching the Terasic Golden_top.v conventions
from the DE25-Nano resource CD): `CLOCK0_50` / `CLOCK1_50` /
`CLOCK2_50`, `KEY[1:0]` (active-LOW), `SW[3:0]`, `LED[7:0]`
(active-LOW), and raw `VIDEO_R/G/B/HSYNC/VSYNC/DE` outputs that
a future PHY shim will consume.
**Reset bridge**:
1. `ninit_done` sourced from Terasic's `reset_release` IP under
`\`ifdef USE_TERASIC_RESET_RELEASE_IP` (default-off; sim uses
an inline 16-cycle stub matching the IP's shape).
2. `KEY[0]` + `ninit_done` feed an async-assert/sync-deassert
2-stage shift register on CLOCK2_50. Mirrors the retroDE_nes
pattern at `retroDE_nes.sv:170-177`.
**`core_go` sequencer**: 16-cycle delay after `core_rst_n`
deasserts, then a one-cycle `core_go` pulse. Matches the
"recommended hardware path" documented in the Ch148 README and
the level-sensitive `go_i` semantics at `ee_core_stub.sv:812-813`.
**LED status**: the Ch146 wrapper's three sticky status outputs
drive `LED[2:0]` (active-LOW): `LED[0] = ~core_halt`,
`LED[1] = ~dma_done_seen`, `LED[2] = ~frame_seen`. `LED[7:3]`
tied HIGH (OFF).
**Smoke TB counts**: `core_go_pulses=1`, all three status LEDs
eventually latch (the actual fall-edge order is `frame_seen`
first, then `core_halt`, then `dma_done_seen` — `frame_seen`
is a "PCRTC alive" indicator that fires on the first empty
frame after reset, well before the bootlet runs), and
`VIDEO_DE` rises inside the active region. Standalone PASS.
`.qsf` (pin assignments), PLL, and video PHY shim remain
deferred (Ch150+). Ch149 makes the design board-shaped, not
yet board-pinned.
### Quartus scaffold for DE25-Nano (Ch150)
Ch150 commits the first real Quartus artifacts for the Ch149
board wrapper — a minimal `.qsf` + `.sdc` pair, deliberately
PHY-light:
| File | Purpose |
|-----------------------------------------------------------------|-------------------------------------------------------------------|
| `synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.qsf` | Device + family + pin assignments + IO standards + .mem macros + file list. |
| `synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc` | CLOCK2_50 50 MHz clock + reset-sync false-path + IO false-paths. |
| `make top_psmct32_raster_demo_quartus_scaffold_check` | Validates both files exist + top entity + pins + clock period. |
**Device** (sourced from `retroDE_splash.qsf`): Agilex 5
`A5EB013BB23BE4SCS`, package `VPBGA`. **Top entity**:
`de25_nano_psmct32_raster_demo_top` (the Ch149 board wrapper —
NOT the inner Ch146 module). **Pin assignments** match the
DE25-Nano board pinout used by `retroDE_splash` and
`retroDE_nes`: `CLOCK2_50` → `PIN_BF23`, `KEY[0]` → `PIN_C8`,
`LED[2:0]` → `PIN_DN22 / PIN_DJ32 / PIN_DF35`. CLOCK0/1_50,
KEY[1], SW[3:0], and LED[7:3] are also assigned (their canonical
pins) so Quartus doesn't flag them as unconstrained inputs/
outputs even though the Ch149 wrapper ties them off.
**SDC** (sourced from `retroDE_splash.sdc`): a single 50 MHz
`create_clock` on CLOCK2_50, the standard reset-sync first-stage
false-path (`set_false_path -to [get_registers -nowarn
{*rst_sync[0]}]`), and IO false paths for `KEY[*]`, `SW[*]`,
`LED[*]` plus the as-yet-unpinned `VIDEO_*` outputs (replaced
by real `set_output_delay` constraints when the PHY shim
lands).
**`.mem` macros** baked into the QSF (project-relative paths):
`TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE = sim/data/top_psmct32_raster_demo/bios.mem`
and the matching payload macro. Run `make -C sim
top_psmct32_raster_demo_mem` before launching Quartus.
**`USE_TERASIC_RESET_RELEASE_IP`** is **not** defined in this
QSF — keeping the wrapper self-contained for the first project
import. To wire in Terasic's `reset_release` IP, define the
macro and add the IP file from
`DE25_Nano_ResourceCD/Demonstration/FPGA/Board_Info_RTL/reset_release/`.
**Deferred to Ch151+**: video PHY pins + shim (HDMI ADV7513 +
I²C config FSM, VGA DAC, or PMOD), PLL `.ip` config, LPDDR4 /
SDRAM / HPS / CAM / UART / GPIO assignments. Ch150 makes the
project Quartus-importable, not yet Quartus-buildable for video
output.
### PLL + lock-gated reset (Ch151)
Ch151 adds the most conservative hardware bring-up step before
touching the video PHY: a board-clock PLL on the path between
`CLOCK2_50` and the design clock, with the reset bridge gated
on PLL lock so the design can only leave reset once the PLL is
stable.
| Artifact | Purpose |
|-------------------------------------------------------|----------------------------------------------------------------------|
| `rtl/top/de25_nano_pll_stub.sv` | Sim stub matching the Quartus IOPLL `pll` module signature. |
| `rtl/top/de25_nano_psmct32_raster_demo_top.sv` (Ch151) | Reworked with PLL instantiation + lock-gated reset bridge + `design_clk` distribution to the Ch146 wrapper and `core_go` sequencer. |
| `tb_de25_nano_psmct32_raster_demo_top` (Ch151 update) | Adds rising-edge timestamps for `pll_locked` / `core_rst_n` / `core_go` and asserts the contract `pll_locked < core_rst_n < core_go`. |
**PLL signature** (matches `retroDE_nes/ip/pll/pll_bb.v` and
`retroDE_splash/ip/sys_pll/sys_pll_bb.v`):
```
module pll (
input wire refclk,
input wire rst,
output wire outclk_0,
output wire locked
);
```
**Sim stub behavior**: `outclk_0 = refclk` (pass-through, no
multiplication — sim doesn't need a different frequency, and a
pass-through still exercises the lock-gated reset bridge).
`locked` rises after 32 cycles with `rst` low; held LOW while
`rst` is HIGH.
**Reset gating**: the board top's `rst_sync` register
async-asserts on `(ninit_done | ~pll_locked)` — both FPGA init
AND PLL lock must complete before reset can deassert.
**Synth swap**: define `USE_PLL_IP` and add a Quartus IOPLL
`.qip` to the project; the board wrapper's `\`ifdef USE_PLL_IP`
swaps the stub for the real IP. The QSF documents the swap
mechanism but ships with the IP commented out, keeping the
scaffold self-contained until the PLL chapter (Ch152+) commits
a frequency choice + IP file.
**TB contract** (smoke output): `t_pll/rstn/go=(950000,990000,
1330000)` ns — PLL locks at 950 ns, reset deasserts 40 ns
later (the 2-stage sync register prop), `core_go` fires
340 ns later (the GO_DELAY=16 wait). Order assertions catch
any future regression of the gating.
**Deferred to Ch152+**: real PLL output frequency tuning (the
stub passes refclk through; a real build sets `outclk_0` to
whatever the video PHY chapter needs), committing the actual
IOPLL `.ip` file under `synth/de25_nano/.../ip/`, the video
PHY shim itself.
### First Quartus compile + baseline report (Ch152)
Ch152 is the chapter where the toolchain is finally asked the
honest question: "does this DE25-Nano board top synthesize, fit,
and pass static timing analysis?"
**Driver**: [`synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh`](../../synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh)
runs `quartus_syn → quartus_fit → quartus_sta` against the Ch150
QSF + Ch151 PLL stub. `quartus_asm` (bitstream gen) is
deliberately skipped — Ch152 is a compile-and-report smoke,
not a deploy path. `USE_PLL_IP` is left undefined so the Ch151
self-contained PLL stub stays under test (per Codex framing).
**Make targets**:
| Target | Action |
|---------------------------------|-------------------------------------------------------------|
| `make quartus_compile` | Full syn + fit + sta flow. |
| `make quartus_compile_clean` | Wipe outputs first, then full flow. |
| `make quartus_syn_only` | Synthesis only (~14 min smoke). |
| `make quartus_compile_report` | Run [`parse_reports.py`](../../synth/de25_nano/top_psmct32_raster_demo/parse_reports.py) on the latest output. |
**Ch152 RTL fixes that landed before synthesis would even
elaborate**:
| Issue | Fix |
|------------------------------------------------------------------------------------|------------------------------------------------------------------------------|
| QSF line-continuation (`\`) parse error in `set_global_assignment -name VERILOG_MACRO` | Collapsed to single-line lines. |
| `vram_stub.mem` 8192-iter init loop exceeded Quartus's 5000-iter synthesizable-loop limit (Error 13356) | Wrapped initial block in `// synthesis translate_off` / `_on` pragmas. Real Altera/Intel BRAM is power-on-zero so the procedural loop is sim-only. |
| `gs_pcrtc_stub` / `gif_image_xfer_stub` / `gs_stub` unconditionally instantiate all four swizzle math primitives even when their gate is 0 | Added `gs_swizzle_psmct16/8/4_stub.sv` to the synth filelist + QSF (iverilog trimmed silently; Quartus errors). |
| `gs_stub.interp_byte` (Ch86 Gouraud TRI math) 64-bit signed divide hits Quartus Pro's lpm_divide LPM_WIDTHN ≤64 limit (Error 272006) | Wrapped divide in `// synthesis translate_off`; default fallback returns 0. The Ch123 SPRITE-only demo doesn't exercise Gouraud TRIs, so this is dead code in the build. A future Gouraud-TRI hardware demo would need a divider redesign sized for Agilex 5. |
| QSF `SDC_FILE` referenced via repo-root-relative path failed when the build script ran Quartus from a per-build work dir (Warning 16124) | Changed to basename-only — works from either the repo root or the work dir (the script symlinks the SDC alongside the QSF). |
**First successful synthesis**: 0 errors, 3 warnings, 14:08
elapsed. 160 RAM segments + 26 DSP elements inferred.
**Fitter result — design too large for the part (the chapter's
honest answer)**:
```
Total dedicated logic registers : 121,176
Total pins : 17 / 351 ( 5 %)
Total block memory bits : 65,536 / 7,331,840 (<1 %)
Total RAM Blocks : 6 / 358 ( 2 %)
Total DSP Blocks : 20 / 188 (11 %)
Logic utilization (ALMs needed) : 155,104 / 46,800 (331 %)
```
The design needs **155,104 ALMs vs the part's 46,800 — 3.31×
oversized**. `Error (170011): Design contains 260,263 blocks of
type combinational node. However, the device contains only
93,600 blocks.`
**Why so big** (the precise picture, to be drilled into by Ch153+):
The synthesis log reports `Info (22567): extracting RAM` for
**all four** memory identifiers — `ee_ram_stub.mem`,
`bios_rom_stub.mem`, `ee_memory_map_stub.useg_shadow_mem`, and
`vram_stub.mem` — so Quartus *did* recognize each as a memory
structure at syn time. But the fit report shows only **65,536
bits / 6 RAM Blocks** committed (roughly enough for BIOS 4 KB +
EE-RAM 4 KB). Something between syn and fit caused the larger
arrays — most likely `vram_stub.mem` (8 KB) and possibly
`useg_shadow_mem` (4 KB after Ch145's 1024-word shrink) — to
either (a) be replicated into combinational mux/decoder logic
because of their access-port shape, or (b) lose their RAM
attribute during fitter optimization and fall back to
flip-flop implementation. The 121,176 dedicated registers + the
260,263 combinational nodes are consistent with at least
`u_vram` getting massively unrolled.
Ch153's job is to isolate **which array(s)** and **which port
shape(s)** prevent compact block-RAM implementation. The
likely candidates: `vram_stub`'s dual read ports + per-byte
write_be lane (Ch95's per-byte gate may not be RAM-block-
friendly on Agilex 5), and the EE memory map's wide arbitration
into the useg-shadow port. None of this is fixed in Ch152 —
surfacing the gap precisely is the chapter's deliverable.
**Other notable findings** (full list in
[`output_files/build_logs/`](../../synth/de25_nano/top_psmct32_raster_demo/output_files/build_logs/)):
- **Critical Warning 20759**: "Use the Reset Release IP in
Agilex 5 designs to ensure a successful configuration." This
is the Ch151 `\`ifdef USE_TERASIC_RESET_RELEASE_IP` opt-in;
enabling it (and committing the IP file) is a Ch153+ task.
- **6× Warning 16749**: identifiers used before declaration in
`dmac_reg_stub`, `gif_packed_stub`, `gs_stub`,
`gif_image_xfer_stub`. Style/lint warnings, no functional
impact; clean-up candidate for a future polish chapter.
- **STA never ran** because fit failed.
**What Ch152 leaves for Ch153+**:
- Resource reduction. Most likely candidates: BRAM-infer
`vram_stub.mem` and `useg_shadow_mem` cleanly (Quartus
attribute hints / restructure read ports), or shrink the EE
core's MIPS decode (table-driven vs LUT-driven), or move to
a larger Agilex 5 part if available.
- Enabling `USE_TERASIC_RESET_RELEASE_IP` and committing the
Terasic `reset_release` IP file.
- The PHY shim chapter (`VIDEO_*` virtualized → real HDMI
ADV7513 / VGA / PMOD pins).
- Cleaning up the 6× forward-reference style warnings.
### Memory-shape forensics (Ch153)
Ch153 is a memory-forensics chapter (NOT a rewrite chapter): two
isolated tiny Quartus projects under [`synth/de25_nano/experiments/`](../../synth/de25_nano/experiments/)
target the same Agilex 5 part as the Ch150 board top so resource
deltas are apples-to-apples. The goal is to identify which feature(s)
of `vram_stub`'s shape prevent compact block-RAM implementation and
drive the Ch152 size deficit.
| Experiment | Memory shape |
|-----------------------|-----------------------------------------------------------------------------------------------|
| `exp_a_bram_friendly` | 2048 × 32-bit, single port, sync read + sync write with byte-WE. Intel-friendly BRAM template. |
| `exp_b_vram_shape` | 8192 × 8-bit, dual COMBINATIONAL read, byte-WE + per-bit mask RMW. Exact `vram_stub` shape. |
**The result is decisive**:
| Metric | exp_a (BRAM-friendly) | exp_b (vram_stub-shape) |
|---------------------------------|-----------------------|-------------------------|
| Fitter status | ✅ **Successful** | ❌ **Failed** |
| Logic utilization (ALMs) | **46** / 46,800 (< 1 %) | (fit failed — placement reports 257,986 combinational nodes vs 93,600 device max) |
| Total dedicated logic registers | **0** | **65,536** |
| Total RAM Blocks | **4** / 358 (1 %) | **0** / 358 (0 %) |
| Total block memory bits | **65,536** (8 KB) | **0** |
**Interpretation**:
- The Intel-friendly shape maps the same 8 KB to **4 RAM Blocks**
with **zero combinational logic and zero registers** beyond the
read-output flop.
- The `vram_stub` shape maps the same 8 KB to **zero RAM Blocks**,
**65,536 dedicated registers** (one flip-flop per byte), and
**257,986 combinational nodes** (the 4-byte concatenation
multiplexers for the dual combinational reads + the per-bit
mask RMW gates).
- The 257,986 combinational-node figure for a single 8 KB memory
almost exactly matches the 260,263 combinational-node figure
Ch152 reported for the **entire top-wrapper design** —
empirical confirmation that `u_vram` alone accounts for
essentially all of the Ch152 size deficit.
**Which feature is the dominant cost** (the four candidates the
shape diff isolates):
The exp_a vs exp_b diff folds four feature changes together
(byte-addressable storage, combinational reads, dual reads,
per-bit mask RMW). To pin down which feature(s) dominate, a
future chapter could insert intermediate experiments — but the
exp_a result already gives the upper bound on what BRAM-native
inference can buy: ~4 RAM blocks + ~50 ALMs for 8 KB. Anything
that gets `vram_stub` close to that bar wins back the entire
Ch152 fit headroom.
The most likely individual culprit is the **per-bit mask RMW**:
Agilex 5's M20K BRAM has byte-WE primitives but does NOT have
per-bit RMW. Quartus has to materialize the
`(mem & ~mask) | (data & mask)` arithmetic outside the BRAM,
which forces the storage out of BRAM and into per-bit flip-flops.
Combinational reads are the second most likely (BRAMs are
synchronous-read-only on Agilex 5; Quartus has to either insert
a register on the read path or materialize the storage as
discrete flip-flops to feed the comb output).
**Make targets**:
| Target | Action |
|---------------------------------------|--------------------------------------------------------------|
| `make quartus_experiments` | Compile every `synth/.../experiments/exp_*` project. |
| `make quartus_experiments_clean` | Wipe outputs first, then compile. |
| `make quartus_experiments_report` | Side-by-side resource summary (no recompile). |
**What Ch153 leaves for Ch154+**:
- Refactor `vram_stub` into a BRAM-friendly shape: replace
combinational reads with sync (registered output) reads,
replace per-bit mask RMW with byte-WE-only writes (move the
per-pixel sub-byte merging logic into the writer module —
most likely `gs_stub.raster_pixel_emit` for the PSMT4 nibble
case), and switch to 32-bit word-addressable storage with
byte-WE for the unaligned-byte case.
- Audit `useg_shadow_mem` next — it had `Info (22567): extracting RAM`
at synthesis but didn't survive to fit. Likely culprits there:
the `Ch64` / `Ch65` / `Ch70` mirror-write features that turn
the simple useg-shadow into a multi-port write structure.
### BRAM-friendly vram sibling (Ch154)
Ch154 adds a hardware-friendly sibling of `vram_stub` —
[`rtl/gif_gs/vram_bram_stub.sv`](../../rtl/gif_gs/vram_bram_stub.sv) — that maps cleanly onto Agilex 5
M20K block-RAM. Per Codex's framing, the chapter's blast radius
stays narrow: **add the sibling + prove it works + measure the
BRAM-inference win**. The actual swap of the board top to use
the new module + the writer-side PSMT4 nibble-RMW rework lands
in Ch155+.
**`vram_bram_stub` shape vs `vram_stub`**:
| Feature | `vram_stub` (legacy / sim reference) | `vram_bram_stub` (Ch154, hw-friendly) |
|----------------------------|-------------------------------------|----------------------------------------|
| Storage | 8192 × 8-bit byte-addressable | 2048 × 32-bit word-addressable |
| Reads | Combinational; arbitrary alignment | Synchronous (1-cycle); word-aligned only |
| Read ports | 2 (combinational) | 2 (sync, true dual-port M20K) |
| Write granularity | byte-WE + per-bit `write_mask` RMW | byte-WE only |
| Per-bit mask RMW (Ch106) | yes — supports PSMT4 nibble splice | NO — caller must splice on writer side |
**New equivalence TB**: [`tb_vram_bram_stub_equivalence`](../../sim/tb/gif_gs/tb_vram_bram_stub_equivalence.sv).
Drives both DUTs in lockstep with byte-WE-only writes
(`write_mask = 0xFFFFFFFF` on the legacy module so the per-bit
RMW path is a no-op), aligns sample times across the new
module's 1-cycle sync-read latency, and asserts data
equivalence across:
- 32-bit word writes (`be=4'b1111`)
- per-byte-lane writes (`be=4'b0001 / 0010 / 0100 / 1000`)
- per-byte non-wrapping admission near MAX_BASE
- dual-port read agreement
PASS standalone + in the full sim regression.
**Quartus experiment `exp_c_vram_bram_stub`** ([synth/.../experiments/exp_c_vram_bram_stub/](../../synth/de25_nano/experiments/exp_c_vram_bram_stub/))
proves the new module infers BRAM cleanly. Side-by-side with
the Ch153 baselines, all on the same Agilex 5 part:
| Experiment | Fit | ALMs | Registers | RAM Blocks | Block memory bits |
|------------------------|-----------|------|-----------|------------|-------------------|
| `exp_a_bram_friendly` | ✅ Success | **46** / 46,800 | **0** | **4** / 358 | 65,536 |
| `exp_b_vram_shape` | ❌ Failed | (261,578 comb nodes vs 93,600 device max) | **65,536** | **0** / 358 | 0 |
| `exp_c_vram_bram_stub` | ✅ Success | **190** / 46,800 | **2** | **8** / 358 | 131,072 |
**Interpretation**:
- `exp_c` lands close to `exp_a`'s ideal (190 vs 46 ALMs; 8 vs
4 RAM Blocks). The slight overhead vs `exp_a` is the dual
read port (M20K replicates storage to serve two independent
read addresses simultaneously, hence 2× block memory bits)
plus the per-byte non-wrapping admission gate Ch95 inherited
from `vram_stub`.
- `exp_c` consumes **3.4× fewer** dedicated registers than
`exp_a` would have if `read_data` was reset (2 vs the 32 a
reset would require) — the canonical Quartus inference
template demands no reset on the BRAM data register.
- vs `exp_b`'s **65,536 registers + 261,578 combinational nodes**,
swapping `vram_stub` → `vram_bram_stub` recovers essentially
all of the Ch152 ALM headroom on the vram side. Useg-shadow
is the next forensic target (likely similar shape).
**Inference template gotcha** (caught + fixed in this chapter):
the first cut of `vram_bram_stub` had a reset on `read_data`
inside the always_ff block AND an in-bounds gate guarding the
`mem` read. Quartus rejected BRAM inference with
`Info (276007): RAM logic ... uninferred due to asynchronous
read logic`. Fix: simplified the read path to the canonical
template (`always_ff @(posedge clk) read_data <= mem[idx];`)
and moved bounds + alignment checks to a parallel `read_valid`
pipeline. Then `Implemented 64 RAM segments` instead of 0.
**Ch155+ surface — writer-side normalization for ALL sub-32-bit
PSMs, not just PSMT4**: `vram_bram_stub`'s contract is stricter
than `vram_stub`'s — `write_addr` MUST be word-aligned
(`write_addr[1:0] == 2'b00`), and the byte lane(s) being written
are selected via `write_be` with the payload pre-shifted into
the right byte lane(s) of `write_data[31:0]`. Today's writer-
side RTL emits at sub-word boundaries:
- **PSMCT16** raster + image-xfer write at halfword addresses
(`write_addr[1] == 1` for the high halfword) with `be=4'b0011`
or `4'b1100` and the 16-bit payload in `write_data[15:0]`.
- **PSMT8** raster + image-xfer write at byte addresses
(any `write_addr[1:0]`) with `be=4'b0001` and the 8-bit payload
in `write_data[7:0]`.
- **PSMT4** raster + image-xfer write at byte addresses with
`be=4'b0001` + per-bit `write_mask` 0x0F or 0xF0 to splice
one nibble.
- **PSMCT32** raster + image-xfer write at word addresses with
`be=4'b1111` + the full 32-bit payload — the ONLY PSM that
natively matches `vram_bram_stub`'s contract today.
If we swap the board top to `vram_bram_stub` without writer-side
normalization, **CT16/T8/T4 writes silently drop** because
`write_addr[1:0] != 0` fails admission. So Ch155 must rework
each writer to:
1. Mask `write_addr` down to its word base (`write_addr & ~32'd3`).
2. Shift the payload from its native byte lane into the
appropriate byte lane(s) of a 32-bit `write_data` based on
the original `write_addr[1:0]`.
3. Generate `write_be` with bits set only for the byte lanes
the original sub-word address actually targets.
4. **For PSMT4 specifically**: replace the per-bit `write_mask`
nibble splice with a writer-side read-modify-write — read
the existing byte first, splice the new nibble in, then
issue a normal byte-WE write. Adds ~1 cycle of latency per
nibble-write but that's well within the 16×8 demo budget.
The rework lands inside `gs_stub.raster_pixel_emit` (Ch95/Ch105/
Ch106 wrote the legacy paths) and `gif_image_xfer_stub`'s per-
PSM dispatch. A focused TB that drives sub-word writes through
the normalizer and asserts the resulting `vram_bram_stub` words
match the legacy `vram_stub` byte-/halfword-/nibble-level
state would be the cleanest proof.
**Other Ch155+ work**:
- Update scanout / debug TBs that sample VRAM via vram_stub's
combinational reads to handle the 1-cycle sync-read latency
(or keep them on `vram_stub` if they're sim-only).
- Swap the Ch146 board top to instantiate `vram_bram_stub`
AFTER the writer-side normalization lands. Rerun the full
Quartus compile and expect a dramatic ALM/register reduction.
- Audit `useg_shadow_mem` next — Ch64/Ch65/Ch70 mirror-write
features may make it multi-port-write-shaped.
### VRAM write normalizer + first BRAM integration (Ch155)
Ch155 lands the writer-side normalization layer that bridges
the contract gap between the legacy `vram_stub` (byte-addressed
sub-word writes + per-bit RMW) and the new `vram_bram_stub`
(word-aligned + byte-WE only). Per Codex's framing the chapter
keeps blast radius narrow: build the normalizer + verify it
standalone for all 4 PSMs + prove the easiest case (PSMCT32)
end-to-end through the new VRAM. RTL plumbing into
`gs_stub.raster_pixel_emit` and `gif_image_xfer_stub` lands in
Ch156+.
| Artifact | Purpose |
|------------------------------------------------------------|------------------------------------------------------------------|
| `rtl/gif_gs/vram_normalize_pkg.sv` | Pure-comb `normalize_write` function — natural byte address + PSM + payload + (T4-only) old_byte → word-aligned write_addr + shifted write_data + write_be. |
| `tb_vram_normalize_write` | Focused unit TB — 17 cases across CT32 / CT16 / T8 / T4 lanes + misuse detection. |
| `rtl/top/top_psmct32_raster_demo_bram.sv` | Sibling of the Ch146 wrapper with `vram_bram_stub` swapped in. |
| `tb_top_psmct32_raster_demo_bram` | Integration TB — drives Ch146 fixtures + verifies VRAM contents at PSMCT32 swizzled addresses via hierarchical probe. |
**Function contract** (`vram_normalize_pkg::normalize_write`):
| PSM | byte_addr alignment | payload bits used | output `write_be` shape | extras |
|-----------|---------------------|-------------------|-------------------------|--------|
| PSMCT32 | word (`addr[1:0]==0`) | `payload[31:0]` (full ABGR) | `4'b1111` | misuse → drop (`be=0000`) |
| PSMCT16 | halfword (`addr[0]==0`) | `payload[15:0]` (RGB5A1) | `4'b0011` (low) / `4'b1100` (high), keyed on `addr[1]` | misuse → drop |
| PSMT8 | byte (any) | `payload[7:0]` (index byte) | one of `4'b0001 / 0010 / 0100 / 1000`, keyed on `addr[1:0]` | — |
| PSMT4 | byte (any) | `payload[3:0]` (nibble) | one of `4'b0001 / 0010 / 0100 / 1000`, keyed on `addr[1:0]` | needs `old_byte` + `nibble_hi`; output is the spliced full byte at the addressed lane |
| any other | — | — | `4'b0000` | — |
**PSMT4 splice math** (the only PSM whose output depends on
prior memory state): given `nibble_hi=0`, the function returns
`new_byte = {old_byte[7:4], payload[3:0]}` — preserves the
upper nibble, replaces the lower. With `nibble_hi=1`,
`new_byte = {payload[3:0], old_byte[3:0]}`. The CALLER is
responsible for sourcing `old_byte` via a 1-cycle read of
`mem[byte_addr]` upstream of the write; the function itself is
purely combinational. The Ch156+ RTL plumbing chapter is
where that read pipeline lives inside
`gs_stub.raster_pixel_emit` and `gif_image_xfer_stub`.
**`top_psmct32_raster_demo_bram` integration result**: the new
sibling wrapper substitutes `vram_bram_stub` for `vram_stub`,
drops `write_mask` wiring (CT32's `mask=0xFFFFFFFF` makes the
per-bit RMW path a no-op so dropping it is functionally
equivalent), and accepts the 1-cycle sync-read latency on
PCRTC's `vram_read_data` path (so PCRTC scanout is 1-pixel
shifted; the integration TB skips frame capture and verifies
VRAM content via direct hierarchical probe). All 128 pixel
words at canonical PSMCT32 swizzled addresses match expected
ABGR. Standalone PASS.
**Ch155 critical audit check**: `vram_normalize_write`'s
function-level misuse handling pins the contract — passing an
unaligned `byte_addr` for CT32 OR CT16 returns `write_be=4'b0000`,
which `vram_bram_stub` then drops cleanly. Combined with
Codex's stance that "no sub-32-bit writer is allowed to hand
an unaligned address directly to vram_bram_stub", the Ch156+
plumbing chapter has a hard contract to verify against.
**Ch156+ surface**:
- Insert a 1-cycle byte-read pipeline upstream of the PSMT4
raster emit + image-xfer paths inside `gs_stub` and
`gif_image_xfer_stub`. The read returns `old_byte` for
`normalize_write`'s splice input.
- Apply `normalize_write` to all four PSM emit lanes inside
both writers.
- Add focused TBs for PSMCT16 / PSMT8 / PSMT4 paths analogous
to `tb_top_psmct32_raster_demo_bram` — each verifies the
swizzled VRAM contents under the new normalizer + bram_stub.
- Add a 1-cycle address-stage register inside
`gs_pcrtc_stub` so scanout consumers see a clean
combinational-look read (`addr` → `data` with the BRAM's
internal sync stage hidden).
- Once all four lanes pass, swap the Ch146 board top to use
`vram_bram_stub` directly (or retire `vram_stub` outright).
- Audit `useg_shadow_mem` next — the Ch64/Ch65/Ch70 mirror-
write features may make it multi-port-write-shaped, which
is its own forensic exercise.
### Writer-side normalize plumbing — CT16 + T8 (Ch156)
Ch156 plumbs the Ch155 `vram_normalize_pkg::normalize_write`
function into the BRAM-friendly path so PSMCT16 and PSMT8
raster emits land at the right `vram_bram_stub` byte lane. The
chapter intentionally keeps blast radius narrow — the function
is wired in at the **wrapper site** between the unmodified
writer engines (`gs_stub.raster_pixel_emit`) and
`vram_bram_stub`, so the legacy byte-addressable contract on
gs_stub's raster emit ports stays exactly as Ch128/Ch134 / etc.
defined them. PSMT4 still requires the read-modify-write
pipeline and is deferred to Ch157+.
| File / target | Role |
| ---------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- |
| `rtl/top/top_psmct32_raster_demo_bram.sv` | Wrapper updated: `raster_pixel_psm_q` exposed; `bitbltbuf_q[61:56]` provides the PSM during xfer; the muxed `(byte_addr, psm, payload)` triple is run through `vram_normalize_pkg::normalize_write` and the result feeds `vram_bram_stub`. CT32 path remains a passthrough; CT16/T8 paths now write to the right lane. |
| `tb_gs_raster_bram_psmct16` | Focused CT16 integration TB — 16×4 SPRITE at FBP=0/FBW=1, halfword 0x6155. Drives gs_stub#(PSMCT16_SWIZZLE=1) directly; verifies all 64 swizzled halfwords land in `u_vram.mem[byte_addr >> 2]` at the addr[1]-keyed lane; pins the linear-stride separator at byte 0x80 = zero. |
| `tb_gs_raster_bram_psmt8` | Focused PSMT8 integration TB — 16×8 SPRITE at FBP=0/FBW=2, byte index 0xA5. Drives gs_stub#(PSMT8_SWIZZLE=1) directly; verifies all 128 swizzled bytes land in `u_vram.mem[byte_addr >> 2]` at the addr[1:0]-keyed lane. |
**Why wrapper-site, not in-engine**: keeping `gs_stub` and
`gif_image_xfer_stub` byte-addressable preserves the contract
that every Ch128 / Ch134 / Ch140 swizzle TB (and the legacy
`vram_stub`) was written against. Ch156's only structural
change is that a top wrapper which targets `vram_bram_stub`
also runs `normalize_write` between the writer and VRAM. A
future chapter can promote the normalizer into the writer
engines once we've decided to retire `vram_stub`; until then
the function lives where it can be removed without changing
the writers.
**PSMT4 deferral — explicit hard-gate** (Ch156 audit Medium #1
fix; **superseded by Ch157**): when Ch156 closed, the wrapper
masked `write_en` off when the active PSM was PSMT4
(`vram_psmt4_block = (vram_psm_pre == PSM_PSMT4)`,
`vram_we_mux = vram_we_pre && !vram_psmt4_block`). Without that
gate, `normalize_write`'s PSMT4 branch returned a real one-byte
write spliced against `old_byte=0`, silently corrupting VRAM
on any T4 raster emit. The Ch156 focused TB
`tb_gs_raster_bram_psmt4_gate` drove a 16×4 PSMT4 SPRITE
through the wrapper-shape gate and asserted (1) raster_pixel_emit
pulses fired, (2) every pulse hit the gate (`blocked == emit`),
(3) VRAM stayed at sentinel 0xDEADBEEF — zero corruption.
**Ch157 retires both the gate and that TB**: the wrapper now
runs a real RMW pipeline (see "PSMT4 RMW pipeline" section
below) and supplies a live `old_byte` so the splice produces
correct bytes. The retired TB's coverage is replaced by
`tb_gs_raster_bram_psmt4`, which drives the same kind of PSMT4
SPRITE but verifies *correct* nibble splices instead of
*absence* of writes.
**Adversarial coverage on the CT16 / PSMT8 TBs** (Ch156 audit
Medium #2 fix): both TBs originally drove a single uniform
payload across the whole sprite, so a buggy normalizer that
wrote all four byte lanes (or duplicated payload, or stomped
neighboring lanes) could still leave every checked pixel
matching. The TBs now split the image into TWO half-width
SPRITEs with **distinct** payloads:
- `tb_gs_raster_bram_psmct16` drives `(0,0)..(7,3)` with
halfword 0x6155 (low halfword lane via PSMCT16 swizzle) and
`(8,0)..(15,3)` with halfword 0x9F8E (high halfword lane of
the same 32-bit words). Sentinel preload (0xDEADBEEF) on
every VRAM word before the drive plus a linear-stride
separator check at byte 0x80 (outside the swizzled set).
- `tb_gs_raster_bram_psmt8` drives `(0,0)..(7,7)` with byte
0xA5 (lanes {0,1}) and `(8,0)..(15,7)` with byte 0x5A
(lanes {2,3}). Same sentinel preload.
A normalizer that swaps lanes, sets be too wide, or fails to
preserve the other halfword/byte lane(s) of the shared word
now surfaces as a per-pixel mismatch.
**Sim regression**: 141 PASS / 0 FAIL after the audit fixes
(140 + the new `tb_gs_raster_bram_psmt4_gate`).
**xfer-side coverage**: `gif_image_xfer_stub` already feeds
the wrapper's pre-normalize mux during `xfer_busy`. CT32
TRXDIR uploads (no Ch156 TB exists yet, but the path is
wired) pass through the normalizer cleanly because xfer
emits CT32 word-aligned. CT16 + T8 xfer TBs that exercise
this path are a follow-on item — the wiring is already in
place; only a focused TB is missing.
**Sim regression**: 140 PASS / 0 FAIL after Ch156 (138 +
2 new BRAM-integration TBs).
### PSMT4 RMW pipeline — `vram_bram_stub` writes enabled (Ch157)
Ch157 closes the last writer-PSM gap that Ch156 left behind: the
PSMT4 hard-gate is replaced by a wrapper-site read-modify-write
pipeline that supplies a LIVE `old_byte` from VRAM, splices the
new nibble against it, and commits a full-byte write through
`vram_bram_stub`'s byte-WE (no per-bit RMW required). The
nibble splice itself uses the SAME math as `vram_normalize_pkg`'s
PSMT4 branch (`new = nibble_hi ? {nib, old[3:0]} : {old[7:4], nib}`)
but lives **inline in the wrapper**, not inside a call to
`normalize_write` — the function is pure-comb and would have
required `old_byte` to be combinationally available, whereas
`vram_bram_stub`'s registered read port hands the byte back one
cycle later. The CT32/CT16/T8 paths still call `normalize_write`
directly (same-cycle, no read-back required). Goal Codex framed:
"all writer PSMs safe before swapping the board top."
**Pipeline shape** (inside
[`rtl/top/top_psmct32_raster_demo_bram.sv`](../../rtl/top/top_psmct32_raster_demo_bram.sv)):
```
emit cycle N: is_t4_emit=1; vram_read2_addr = byte_addr & ~3;
pipe_q <= (byte_addr, nibble_hi, nibble[3:0]).
posedge → cycle N+1: vram_read2_data = mem[byte_addr] (sync read);
splice new_byte = nibble_hi
? {nibble, old[3:0]}
: {old[7:4], nibble};
drive vram_we_final=1, write_addr=byte_addr&~3,
write_data shifted to byte_addr[1:0] lane,
write_be one-hot to that lane.
posedge → cycle N+2: mem[byte_addr] commits new_byte.
```
`old_byte` is sourced from the lane-correct slice of
`vram_read2_data`. CT32/CT16/T8 emits skip the pipe entirely and
fall through `vram_norm` same-cycle (CT32 stays a passthrough,
existing TBs unaffected).
**Forwarding hazard — back-to-back same-byte writes**: a PSMT4
SPRITE rasters adjacent pixels at `x=2k` and `x=2k+1` to the
SAME `byte_addr` (low + high nibble of a single byte). At cycle
N+1 the wrapper reads `mem[byte_addr]` for emit-2 in the SAME
posedge that emit-1's write commits. NBA semantics inside
`vram_bram_stub` (separate `always_ff` blocks for the write port
and the read port) make the read see the PRE-write value, so
emit-2 would splice against stale data. The Ch157 pipe carries
a 1-deep `t4_prev_*` register set (addr + new_byte from the
just-completed RMW) and forwards `t4_prev_new_byte_q` whenever
the in-flight emit's `byte_addr` matches the previous emit's
`byte_addr`. The forwarding chain extends across any number of
back-to-back same-byte emits — emit-N reads emit-(N-1)'s
`new_byte` from the forward register, splices on top, and
emit-(N+1) reads emit-N's new_byte from that same register.
| File / target | Role |
| ---------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
| `rtl/top/top_psmct32_raster_demo_bram.sv` | Ch156 hard-gate replaced by the RMW pipe + forwarding registers; `vram_read2_addr` driven on T4 emit cycles; `vram_we_final` mux selects T4 pipe write or non-T4 same-cycle path. |
| `tb_gs_raster_bram_psmt4` | New positive-proof TB — drives a 16×4 LINEAR PSMT4 SPRITE (PSMT4_SWIZZLE=0 so adjacent x's hit the same byte) split into two halves with distinct nibbles (0xA / 0x5). 64 raster emits; verifies every byte under the sprite holds the expected pair of spliced nibbles (left half = 0xAA, right half = 0x55) plus sentinel preserved on bytes outside the sprite. **PASS**. |
| `tb_gs_raster_bram_psmt4_gate` | Retired — the gate it asserted no longer exists. |
**Why LINEAR PSMT4 in the new TB**: the linear address formula
`(y*FBW*32) + (x>>1)` puts adjacent x's at the same byte, which
is exactly the back-to-back same-byte forwarding hazard. The
swizzled path scatters bytes via `columnTable4`, so it touches
the forwarding logic less often. Linear coverage is strictly
stronger here.
**Non-T4 TB cleanup**: `tb_gs_raster_bram_psmct16` and
`tb_gs_raster_bram_psmt8` still mirror the *non-T4* portion of
the wrapper-site plumbing, but they no longer carry the Ch156
PSMT4 hard-gate (now removed in the wrapper). Both wire
`raster_pixel_emit` straight to `write_en` and let
`vram_norm` drive addr/data/be — focused TBs verifying their
own PSM lane. Full pipe coverage lives in `tb_gs_raster_bram_psmt4`
and the top wrapper TB.
**Sim regression**: 141 PASS / 0 FAIL after Ch157 (140 + new
`tb_gs_raster_bram_psmt4` retired `tb_gs_raster_bram_psmt4_gate`).
### PCRTC sync-read alignment (Ch158)
Ch158 closes the last big blocker before swapping the board top
to `vram_bram_stub`: the PCRTC's data-decode + sync-output
pipeline is now aware that `vram_bram_stub`'s `read_data` is
registered with 1-cycle latency, so the captured scanout no
longer trails the address stage by one column.
**`gs_pcrtc_stub` change** (in
[`rtl/gif_gs/gs_pcrtc_stub.sv`](../../rtl/gif_gs/gs_pcrtc_stub.sv)):
new module parameter `VRAM_SYNC_READ` (default 0). When set to 1,
every hcnt/vcnt-derived signal that the data-decode comb consumes
is run through a 1-cycle register before the consumer sees it
(`active_h_dec`, `active_v_dec`, `in_hsync_dec`, `in_vsync_dec`,
`in_display_window_dec`, `scanout_enable_dec`, `dispfb_psm_*_dec`,
`psm4_nibble_select_dec`, `end_of_frame_dec`). The address-side
(`vram_read_addr`) keeps using the current `(hcnt, vcnt)` so the
read is issued one pixel "ahead"; the registered `vram_read_data`
arrives one cycle later, paired with the matching delayed counter
view. Outputs `r/g/b/hsync/vsync/de` come from the `_dec` signals,
so the entire output stream shifts right by exactly one clock
when `VRAM_SYNC_READ=1`. Default `VRAM_SYNC_READ=0` is a pure
passthrough — every existing PCRTC TB written against the legacy
`vram_stub` (comb-read) shape is unaffected.
**`top_psmct32_raster_demo_bram` change**: instantiates
`gs_pcrtc_stub` with `.VRAM_SYNC_READ(1'b1)`. The wrapper banner
updates to drop the Ch155 caveat about scanout being 1 column
shifted — that caveat is now resolved.
**`tb_top_psmct32_raster_demo_bram` extension**: adds a Phase 2
frame-capture block that arms on the next vsync rising edge
after raster drain, captures one full frame's r/g/b into
`cap_*[v][h]` indexed by a 1-cycle-delayed copy of PCRTC's
address-stage counters (since the registered `de` aligns with
those delayed counters), and asserts each captured pixel's
post-decode r/g/b matches the expected ABGR for its quadrant.
Phase 1 (per-pixel VRAM probe via hierarchical `mem[byte_addr >> 2]`)
is unchanged. **PASS** — 16×8 active region, all 128 pixels
captured + all 128 VRAM words probe-verified, `frame_seen`
latched.
**Open Ch159+ items**:
- xfer-side T4 coverage TB — the Ch157 wrapper handles xfer-side
T4 emits identically (the mux feeds `vram_psm_pre` from
`bitbltbuf_q[61:56]` during `xfer_busy`), but no focused TB
exercises that path yet.
- Swap the Ch146 board top to instantiate `vram_bram_stub` and
the Ch158 PCRTC-sync mode directly (or retire `vram_stub`
outright). All four writer PSMs and PCRTC scanout are now
proven correct against the BRAM-friendly contract; the
remaining work is the integration commit on the board side.
- Audit `useg_shadow_mem` for the same BRAM-shape forensics that
Ch153 ran on `vram_stub` (Ch64/Ch65/Ch70 mirror writes may
make it multi-port-write-shaped).
**Ch158 audit Medium fix — sub-word PSM lane selection**: the
initial Ch158 cut shifted the data-decode pipeline by 1 cycle
to align with `vram_bram_stub`'s registered output, but it
still extracted CT16 / PSMT8 / PSMT4 sub-word values from the
LOW lane of `vram_read_data` (i.e. `[15:0]` halfword and
`[7:0]` byte). That worked for `vram_stub` (byte-addressable;
the read returns 4 bytes starting at `byte_addr` so the
sub-word always lands at the low lane) but NOT for
`vram_bram_stub` (word-addressable; `read_data` is
`mem[byte_addr >> 2]` so the sub-word lives at lane
`byte_addr[1:0]` of the returned word). Codex Ch158 audit
called this out as a blocker for any sub-word PSM scanout
through the BRAM. The fix adds:
- `vram_addr_lane_q` — 1-cycle-delayed copy of
`vram_read_addr[1:0]`, paralleling the other `_q` decode-
stage registers added in the original Ch158 cut.
- `data_lane = VRAM_SYNC_READ ? vram_addr_lane_dec : 2'd0` —
forces the legacy comb-read path to keep using the low lane
(preserving every existing PCRTC TB's expectation), and
resolves to the correct byte_addr-keyed lane in sync mode.
- `psm16_pixel = data_lane[1] ? read_data[31:16] : read_data[15:0]`.
- A `vram_byte_lane` mux extracting one of 4 byte lanes for
PSMT8 (`psm8_idx`) and PSMT4 (`psm4_byte_lane` → nibble
splice).
Two new focused integration TBs prove the fix end-to-end with
adversarial pre-loads:
| TB | Coverage |
| --------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------- |
| [`tb_gs_scanout_bram_psmct16`](../../sim/tb/gif_gs/tb_gs_scanout_bram_psmct16.sv) | 4-pixel CT16 scanout reading mem[0]/mem[1] with FOUR distinct halfwords across both halfword lanes (`byte_addr[1]∈{0,1}`); each pixel's captured 5→8-decoded RGB matches the expected halfword. **PASS** |
| [`tb_gs_scanout_bram_psmt8`](../../sim/tb/gif_gs/tb_gs_scanout_bram_psmt8.sv) | 4-pixel PSMT8 scanout reading mem[0] with FOUR distinct byte indices, one per byte lane (`byte_addr[1:0] ∈ {0,1,2,3}`); each pixel's grayscale RGB matches the expected byte. **PASS** |
Without the fix, both TBs would have failed: the CT16 TB would
emit the same pair of pixels twice (low halfword of each word),
and the PSMT8 TB would emit `IDX_0` for all four pixels.
**Sim regression**: 143 PASS / 0 FAIL after Ch158 audit fixes
(141 + 2 new BRAM scanout TBs).
### Board-top swap to BRAM wrapper + Quartus fit recovery (Ch159)
Ch159 commits the integration step that the prior chapters
were building toward: the DE25-Nano board top
([`rtl/top/de25_nano_psmct32_raster_demo_top.sv`](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv))
now instantiates [`top_psmct32_raster_demo_bram`](../../rtl/top/top_psmct32_raster_demo_bram.sv)
instead of the legacy [`top_psmct32_raster_demo`](../../rtl/top/top_psmct32_raster_demo.sv).
External port shape is identical so this is drop-in at the
board level; the BRAM-backed wrapper carries through every
Ch155-Ch158 fix (writer-side normalize + PSMT4 RMW pipe +
PCRTC sync-read alignment + sub-word lane select). The synth
file list ([`synth/de25_nano/top_psmct32_raster_demo/files.f`](../../synth/de25_nano/top_psmct32_raster_demo/files.f))
and Quartus QSF gain `vram_normalize_pkg.sv`, `vram_bram_stub.sv`,
and `top_psmct32_raster_demo_bram.sv`; the legacy `vram_stub`
+ legacy top stay on the project for back-compat with sim TBs
that still target them.
**Quartus fit recovery — vs Ch152 baseline**: the headline of
this chapter. Ch152 fit FAILED at 155k ALMs needed (331% over)
because `vram_stub`'s 8 KiB byte-addressable + per-bit-RMW
storage didn't infer as M20K and landed as a 65,536-flip-flop
array, dragging 121k registers and 199k synthesis ALMs along
with it. Ch159 swap turns those numbers around:
| Metric | Ch152 (vram_stub) | Ch159 (vram_bram_stub) | Δ |
| ---------------------------------- | ---------------------------- | ----------------------------- | ----------------------- |
| Synthesis status | Successful | Successful | — |
| Synthesis ALMs estimate | 199,103 / 46,800 (425% over) | **22,704 / 46,800 (49%)** | 176,399 (**88.6%**) |
| Synthesis registers | 101,457 | **36,008** | 65,449 (**64.5%**) |
| **Fit status** | **FAILED** (155k / 331% over) | **Successful** (30,364 / 65%) | ✅ **fits** |
| Fit registers | 121,176 | **39,085** | 82,091 (**67.7%**) |
| Fit RAM blocks | 6 / 358 | **14 / 358** | +8 (BRAM-inferred VRAM) |
| Fit block memory bits | 65,536 | **196,608** | +131,072 (data in M20K) |
| Fit DSP blocks | 20 | 18 | 2 |
| **STA status** | **DID NOT RUN** (fit failed) | **Successful** (12 warnings) | ✅ STA reachable |
| STA setup slack worst (CLOCK2_50) | n/a | 6.950 ns | timing miss at 50 MHz |
| Fmax | n/a | 37.11 MHz | (Ch160+ tunes) |
The eight new RAM blocks are the same `vram_bram_stub`
footprint exp_c proved in Ch154 (8 RAM blocks for the dual-port
+ admission-gated 8 KiB shape; the +6 already in the Ch152
baseline came from `bios_rom_stub` + `ee_ram_stub` +
`useg_shadow_mem` correctly inferring as BRAM there). The
register drop (121k → 39k) is essentially the entire VRAM
flip-flop array vanishing.
**Setup-slack reality check**: STA reports 6.950 ns slack at
the CLOCK2_50 50 MHz constraint (Fmax = 37.11 MHz). The
critical path is somewhere in the Ch123 dep tree's longer
combinational chains (likely the Gouraud divider or one of
the swizzle muxers). That is **NOT** a Ch159 regression — it's
a brand-new visibility unlocked by being able to run STA at
all. Ch160+ owns timing closure (PLL down-clock to ≤30 MHz,
critical-path pipelining, or both).
**Snapshots preserved**: the Ch152 baseline reports are saved
under
[`synth/de25_nano/top_psmct32_raster_demo/baseline_ch152/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch152/)
(syn / fit summaries + flow.rpt + parse_report.txt) so future
chapters can diff against them without re-running the failing
Ch152 baseline.
**Sim regression**: 143 PASS / 0 FAIL unchanged. The Ch149
board-wrapper TB exercises the same external behavior with
the new core wrapper inside.
### Down-clock target + first .sof bitstream (Ch160)
Ch160 closes the loop Codex framed at the end of Ch159 — "first
add a down-clock PLL profile so we can get a real bitstream
moving on hardware, then use the successful STA path report to
decide whether to pipeline toward 50 MHz." The chapter is SDC-
and build-flow-only; no RTL changes.
**SDC retarget** ([`synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc`](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc))
relaxes the CLOCK2_50 period from 20.000 ns (50 MHz) to
33.333 ns (30 MHz). The DE25-Nano's CLOCK2_50 oscillator is
physically still 50 MHz; the SDC tells Quartus to ASSUME a
30 MHz input so the fitter closes timing at the down-clock
target. A real PLL `.ip` that divides 50 → 30 MHz on hardware
is the Ch161+ commit (the QSF's commented-out `QIP_FILE`
swap-point is staged for it). Until then, the .sof produced
under this constraint is structurally clean for 30 MHz
operation; programming it on a board where CLOCK2_50 is still
wired straight through gives an effective 50 MHz chip clock
that may show setup-violating behavior — Ch161 closes that
gap.
**`build_quartus.sh` adds `quartus_asm`** ([`synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh`](../../synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh))
gated on a clean STA, so a `.sof` bitstream is now produced
when the design fits and timing closes. The Make scaffold
check is loosened to accept either the 50 MHz (legacy) or
33.333 ns (Ch160 down-clock) period.
**Quartus result vs Ch159**:
| Metric | Ch159 (50 MHz target) | Ch160 (30 MHz target) |
|-------------------------------|-------------------------------|-------------------------------|
| Synth ALMs estimate | 22,704 / 46,800 (49 %) | 22,704 / 46,800 (49 %) |
| Synth registers | 36,008 | 36,008 |
| Fit status | Successful | Successful |
| Fit ALMs | 30,364 / 46,800 (65 %) | 31,056 / 46,800 (66 %) |
| Fit registers | 39,085 | 37,381 |
| Fit RAM blocks | 14 / 358 | 14 / 358 |
| **STA setup slack worst** | **6.950 ns** (timing miss) | **+0.805 ns** (closes) |
| **Fmax (CLOCK2_50)** | 37.11 MHz | 30.74 MHz |
| **`quartus_asm`** | (skipped) | **Successful — `.sof` produced** |
The synth-side numbers are identical because no RTL changed —
the differences are entirely in the fitter's placement choices
under the looser timing constraint. Fmax dropped slightly
(37.11 → 30.74 MHz) because Quartus optimizes harder when the
target is tighter; the headline is that **at the 30 MHz target
the design CLOSES** (positive slack on every report) and a
real `.sof` is now generated.
**Critical path** (from
[`output_files/de25_nano_psmct32_raster_demo_top.sta.rpt`](../../synth/de25_nano/top_psmct32_raster_demo/output_files/de25_nano_psmct32_raster_demo_top.sta.rpt),
worst-10 paths all in the same module hierarchy):
| Field | Value |
|--------------|------------------------------------------------------------------------------------------|
| Slack | +0.805 ns (worst path of 10 with this slack value) |
| From / To | `u_demo|u_core|div_0_rtl_0|auto_generated|divider|divider|...` (intra-divider register-to-register) |
| Data Delay | 32.516 ns (out of 33.333 ns period) |
| Critical net | The EE core's auto-generated 64-bit signed divider (the Ch152-noted Gouraud TRI divider — dead code in the PSMCT32 raster demo because no `RM_TRI` primitive is dispatched). |
**Ch161+ pipelining handoff**: the path Codex's framing asked
us to surface is now visible. Two options:
1. **Pipeline the divider** — re-implement `ee_core`'s 64-bit
division as an N-cycle multi-cycle path. Quartus's auto-
generated divider is a single-cycle ripple chain; making it
2-3 stage pipelined would put Fmax comfortably above 50 MHz.
2. **Strip it from the build** — gate the Gouraud TRI
divider behind a `STRIP_GOURAUD_TRI` parameter (default
off), so the PSMCT32 raster demo's hardware build instances
the EE core without it. Quartus removes the entire
`div_0_rtl_0` block; Fmax should jump dramatically.
Option 2 is the lower-blast-radius hardware bring-up move
(removes ~32 ns of dead-code combinational chain); option 1
is the long-term correct fix once the Gouraud TRI path goes
load-bearing.
**Snapshots**: Ch159 baseline reports preserved under
[`baseline_ch159/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch159/)
(syn / fit / sta summaries + parse_report).
**Sim regression**: 143 PASS / 0 FAIL unchanged (no RTL
changes). Scaffold check + Ch149 board TB + top BRAM TB all
green under the new SDC.
### Real PLL IP commit — `.sof` actually runs at 30 MHz (Ch161)
Ch161 retires the Ch160 hardware-honesty caveat by committing a
real Quartus IOPLL `.ip` configured for 50 MHz refclk → 30 MHz
outclk_0. The wrapper's `\`ifdef USE_PLL_IP` (staged in Ch151)
now flips to the IP-generated `pll` module on Quartus builds;
sim TBs continue to use the pass-through `de25_nano_pll_stub`.
**Files committed under
[`synth/de25_nano/top_psmct32_raster_demo/ip/`](../../synth/de25_nano/top_psmct32_raster_demo/ip/)**:
- `pll.ip` — adapted from `retroDE_nes/ip/audio_pll.ip` (single-
output Agilex 5 IOPLL template), retargeted to 50 MHz refclk
→ 30 MHz outclk_0.
- `pll/pll.qip` + `pll/synth/pll.v` + `pll/pll_bb.v` — Quartus
IP-generated artifacts (`quartus_ipgenerate de25_nano_psmct32_raster_demo_top --ip_file=ip/pll.ip --generate_ip_file --synthesis=verilog`).
The generated `pll` module exposes
`(refclk, rst, outclk_0, locked)` — exactly the Ch151 stub's
signature, so the `\`ifdef` swap is drop-in.
**Wiring changes**:
- `de25_nano_psmct32_raster_demo_top.qsf` uncommented the
`set_global_assignment -name QIP_FILE ip/pll/pll.qip` swap-
point and added
`set_global_assignment -name VERILOG_MACRO "USE_PLL_IP=1"` so
Quartus instantiates the IP `pll` instead of the
`de25_nano_pll_stub`.
- `de25_nano_psmct32_raster_demo_top.sdc` reverted the Ch160
CLOCK2_50 period back to 20.000 ns (the physical 50 MHz
oscillator). The IOPLL's auto-generated SDC inside the .qip
declares the post-PLL `outclk_0` clock at 30 MHz, so STA
picks up two domains: `u_pll|iopll_0_refclk` (50 MHz, the
pin) and `u_pll|iopll_0_outclk0` (30 MHz, the design clock).
- `build_quartus.sh` symlinks the `ip/` dir alongside the
existing `rtl/` and `sim/` symlinks so the QIP_FILE's
`ip/pll/pll.qip` path resolves from the work dir.
**Quartus result vs Ch160**:
| Metric | Ch160 (SDC profile only) | Ch161 (real PLL IP) |
|-----------------------------------|------------------------------|------------------------------|
| Fit ALMs | 31,056 / 46,800 (66 %) | 30,898 / 46,800 (66 %) |
| Fit registers | 37,381 | 37,352 |
| **Fit PLLs** | **0 / 11** | **1 / 11** (real IOPLL) |
| RAM blocks | 14 / 358 | 14 / 358 |
| Setup slack worst (design_clk) | +0.805 ns @ CLOCK2_50 | **+0.565 ns @ u_pll|iopll_0_outclk0** |
| Fmax (design_clk) | 30.74 MHz | **30.74 MHz** |
| `quartus_asm` | Successful | Successful (`.sof` produced) |
The `+1` PLL block is the real IOPLL on the chip; ALMs go down
slightly because the stub's clock-distribution path no longer
needs ALM glue. STA now reports BOTH clock domains: the refclk
(50 MHz, +19.249 ns slack — trivially fast) and the design_clk
(30 MHz post-PLL, +0.565 ns slack — comfortable margin). The
`.sof` produced under this configuration **genuinely runs at
30 MHz on the DE25-Nano**: the IOPLL takes the 50 MHz CLOCK2_50
input and divides to 30 MHz inside the chip, so the entire
design downstream of `u_pll.outclk_0` operates at the
constrained frequency. (Setup slack landed at +0.914 ns on the
initial Ch161 build; the Ch161 audit's wider reset false-path
nudged the fitter into a slightly different placement, dropping
the worst-case setup slack to +0.565 ns. Recovery analysis on
the rst_sync stages — which had been hiding a real -0.079 ns
violation under the original `*rst_sync[0]` constraint — is now
gone from the .sta.summary entirely after the false-path was
widened to `*rst_sync[*]`.)
**Snapshots**: Ch160 baseline (parse_report + summaries +
`.sof`) preserved under
[`baseline_ch160/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch160/).
**Open Ch162+ items** (Ch161 forward-ref, **superseded by
Ch162 below**):
- ~~Pipeline or strip the EE-core 64-bit Gouraud TRI divider~~ —
**closed in Ch162** via `STRIP_HW_DIVIDER` (note: the actual
divider is the Ch43 DIVU divider, not Gouraud TRI; the
forward-ref's name was loose). The Ch162 strip retired the
`u_demo|u_core|div_0_rtl_0|...` STA worst path entirely; see
the Ch162 section below for the new critical path.
- xfer-side T4 coverage TB (open from Ch157+).
- `useg_shadow_mem` BRAM-shape forensics.
- Video PHY shim (HDMI / VGA / PMOD) — `VIDEO_*` pins
virtualized.
**Sim regression**: 143 PASS / 0 FAIL unchanged. Sim ignores
the `\`ifdef USE_PLL_IP` (no `+define+USE_PLL_IP` in the
iverilog Makefile) so the stub stays active under sim.
### Strip the EE-core hardware divider (Ch162)
Ch162 takes the lower-blast move from the Ch161 STA handoff:
add a parameter that gates the EE-core's auto-inferred 32-bit
hardware divider out of synthesis on the PSMCT32 SPRITE-only
hardware build, then re-measure Fmax.
**RTL change** ([rtl/ee/ee_core_stub.sv](../../rtl/ee/ee_core_stub.sv))
gains `parameter bit STRIP_HW_DIVIDER = 1'b0`. Two `/` and `%`
sites tied to the Ch43 DIVU instruction are gated by this
parameter — the writeback (lines ~932-935) and the retire-
trace `arg3` mirror (lines ~1005-1014). Default `0` keeps
DIVU semantics intact for every existing sim TB
(`tb_ee_core_divu_mflo` is the only consumer; it stays at the
default). When the parameter is `1`, the writeback becomes a
no-op (HI/LO unchanged, identical to the divisor==0 case the
spec calls undefined) and the retire-trace `arg3` reports 0.
Quartus then has nothing to infer — the `div_0_rtl_0` block
disappears.
**Wrapper plumbing**:
[`top_psmct32_raster_demo_bram`](../../rtl/top/top_psmct32_raster_demo_bram.sv)
gains a matching `STRIP_HW_DIVIDER` parameter and forwards it
to `ee_core_stub`. The
[DE25-Nano board top](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv)
sets `.STRIP_HW_DIVIDER(1'b1)` on its `u_demo` instantiation
(the bootlet doesn't execute DIVU, so this is behavior-neutral
for the demo). Sim TBs that instantiate the BRAM wrapper
directly use the default 0.
**Quartus result vs Ch161 (real-PLL baseline)**:
| Metric | Ch161 (real PLL) | Ch162 (real PLL + strip) |
|-----------------------------------|-------------------------------|-------------------------------|
| Fit ALMs | 30,898 / 46,800 (66 %) | 30,006 / 46,800 (64 %) |
| Fit registers | 37,352 | 36,618 |
| Fit PLLs | 1 | 1 |
| RAM blocks | 14 | 14 |
| **Setup slack worst (design)** | +0.565 ns | **+3.567 ns** |
| **Fmax (design domain)** | 30.74 MHz | **33.6 MHz** (+9.4 %) |
| `quartus_asm` | Successful | Successful (`.sof` produced) |
Stripping the divider freed 892 ALMs / 734 registers and
yielded ~3 ns of new setup margin. **Fmax climbs from 30.74
MHz to 33.6 MHz** — a real jump, but **not enough to clear the
50 MHz target** (which would need a +67 % jump). Codex's
Ch162 framing predicted this branch: "if Fmax jumps, we have a
clean path to a 50 MHz demo bitstream; if not, the next real
critical path will reveal itself." We landed in the second
branch — Fmax jumped, but not far enough.
**New critical path** (the Ch163+ handoff, from
[`output_files/de25_nano_psmct32_raster_demo_top.sta.rpt`](../../synth/de25_nano/top_psmct32_raster_demo/output_files/de25_nano_psmct32_raster_demo_top.sta.rpt)):
| Field | Value |
|------------|-------------------------------------------------------------------------------------------------------------------|
| Slack | +3.567 ns |
| From | `u_demo|u_pcrtc|div_1_rtl_0|auto_generated|divider|divider|...` (PCRTC magnification divider) |
| To | `u_demo|u_vram|mem_rtl_0|auto_generated|altera_syncram_impl1|ram_block2a15~reg0` (VRAM port input) |
| Data delay | 38.443 ns of arrival vs 42.010 ns required (period 33.333 ns + clock skew + uncertainty) |
The PCRTC divider comes from
[`gs_pcrtc_stub.sv`](../../rtl/gif_gs/gs_pcrtc_stub.sv) lines:
```
assign vram_x_unshift = {20'd0, hwin_rel} / hmag_factor;
assign vram_y_unshift = {20'd0, vwin_rel} / vmag_factor;
```
where `hmag_factor = MAGH + 1` and `vmag_factor = MAGV + 1`.
For the demo `MAGH = MAGV = 0`, so the divisor is constant 1
— but Quartus doesn't constant-propagate through this
formulation and synthesizes a real 32-bit divider anyway. The
parallel Ch162 fix shape would be a `STRIP_PCRTC_MAG_DIV`
parameter (or a more general "demo doesn't use magnification"
hint that bypasses the divider when both MAGH and MAGV are
constant 0).
**Snapshots**: Ch161 baseline preserved under
[`baseline_ch161/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch161/)
(syn / fit / sta summaries + parse_report + .sof) for diff.
**Open Ch163+ items**:
- Strip the PCRTC magnification divider on hardware builds
(next critical path; same shape as Ch162's
`STRIP_HW_DIVIDER`).
- Once Fmax climbs north of 50 MHz, retune the IOPLL `.ip` to
outclk_0 = 50 MHz, retarget the SDC, and ship a 50 MHz
bitstream.
- xfer-side T4 coverage TB (still open from Ch157+).
- `useg_shadow_mem` BRAM-shape forensics.
- Video PHY shim (HDMI / VGA / PMOD) — `VIDEO_*` pins
virtualized.
**Sim regression**: 143 PASS / 0 FAIL unchanged. Default
`STRIP_HW_DIVIDER=0` preserves DIVU semantics for
`tb_ee_core_divu_mflo`; the board top's `STRIP_HW_DIVIDER=1`
goes through `tb_de25_nano_psmct32_raster_demo_top` cleanly
because the Ch149 board TB doesn't exercise DIVU.
### Strip PCRTC magnification divider + 50 MHz close (Ch163)
Ch163 takes the next critical-path attack from the Ch162 STA
report (the PCRTC magnification divider) and uses the resulting
Fmax headroom to retune the PLL IP to 50 MHz output — closing
the journey that started at the Ch152 fit failure with a real
50 MHz bitstream.
**RTL change** ([rtl/gif_gs/gs_pcrtc_stub.sv](../../rtl/gif_gs/gs_pcrtc_stub.sv))
gains `parameter bit STRIP_PCRTC_MAG_DIV = 1'b0`. The two `/`
operators are gated:
```
assign vram_x_unshift = STRIP_PCRTC_MAG_DIV
? {20'd0, hwin_rel}
: ({20'd0, hwin_rel} / hmag_factor);
assign vram_y_unshift = STRIP_PCRTC_MAG_DIV
? {20'd0, vwin_rel}
: ({20'd0, vwin_rel} / vmag_factor);
```
Default `0` keeps the live divider math for every Ch93-era
magnification scanout TB (`tb_gs_scanout_magh_magv` etc.). When
`1`, the math collapses to a passthrough — equivalent to the
MAGH=MAGV=0 case the demo always hits but with no inferred
divider for Quartus to synthesize.
**Wrapper plumbing**:
[`top_psmct32_raster_demo_bram`](../../rtl/top/top_psmct32_raster_demo_bram.sv)
gains a matching `STRIP_PCRTC_MAG_DIV` parameter that forwards
to `gs_pcrtc_stub`. The
[DE25-Nano board top](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv)
sets `.STRIP_PCRTC_MAG_DIV(1'b1)` on its `u_demo` instantiation.
**Quartus result, two stages**:
*Stage A — strip @ 30 MHz target (still on the Ch161 PLL .ip)*:
| Metric | Ch162 (strip EE divider only) | Ch163 (strip both, 30 MHz) |
|-----------------------|-------------------------------|----------------------------|
| Fit ALMs | 30,006 / 46,800 (64 %) | 27,216 / 46,800 (58 %) |
| Setup slack worst | +3.567 ns | +21.113 ns |
| **Fmax (design)** | 33.6 MHz | **81.83 MHz** (+143 %) |
The Ch163 strip alone freed +17.5 ns of margin and 2,790 ALMs
— large enough to clear 50 MHz outright. Codex's Ch162 framing
predicted both branches of the if-Fmax-jumps fork; Ch163 lands
in the **first** branch ("clean path to a 50 MHz demo
bitstream").
*Stage B — retune PLL .ip from 30 MHz → 50 MHz output*:
The `pll.ip` source's `gui_output_clock_frequency0` and
`gui_output_clock_frequency_ps0` are bumped (30.0 → 50.0 MHz;
33333.333 → 20000.0 ps). `quartus_ipgenerate` rebuilds the
.qip / synth files in-place. No SDC change needed — CLOCK2_50
stays pinned at the physical 50 MHz period; the IOPLL's auto-
generated SDC declares the new outclk_0 frequency.
| Metric | Ch163 strip @ 30 MHz target | Ch163 strip @ 50 MHz target |
|-----------------------|-----------------------------|------------------------------|
| Fit ALMs | 27,216 / 46,800 (58 %) | 27,543 / 46,800 (59 %) |
| RAM blocks / PLLs | 14 / 1 | 14 / 1 |
| **Setup slack worst** | +21.113 ns | **+7.500 ns** |
| **Fmax (design)** | 81.83 MHz | **80.0 MHz** |
| `.sof` produced | yes (30 MHz run on hw) | **yes — 50 MHz on hw** |
**The .sof produced under Stage B genuinely runs at 50 MHz on
the DE25-Nano** — the IOPLL takes 50 MHz CLOCK2_50 in and
emits 50 MHz outclk_0 (effectively a 1:1 relation through the
real PLL hardware so the chip's clock distribution still goes
through the IOPLL's clock network). All 8 timing classes
positive; no recovery violations; build gate Successful.
**Snapshots**:
- [`baseline_ch162/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch162/)
— Ch162 30 MHz state with EE divider stripped only.
- [`baseline_ch163_30mhz/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch163_30mhz/)
— Ch163 strip-both at 30 MHz target (Stage A milestone).
**Open Ch164+ items** (the project has hit the major hardware
milestone Codex called out at Ch157+; Ch164+ is post-launch):
- xfer-side T4 coverage TB (open from Ch157+).
- `useg_shadow_mem` BRAM-shape forensics.
- Video PHY shim (HDMI / VGA / PMOD) — `VIDEO_*` pins still
virtualized; this is the next big front-end deliverable
before the demo can paint a real screen.
**Sim regression**: 143 PASS / 0 FAIL unchanged. Default-off
on `STRIP_PCRTC_MAG_DIV` preserves every Ch93 magnification
scanout TB; the board top's `STRIP_PCRTC_MAG_DIV=1` propagates
cleanly through `tb_de25_nano_psmct32_raster_demo_top` since
the demo locks MAGH=MAGV=0.
### HDMI pin shim — pixels off-chip (Ch164)
Ch164 is the first video-PHY chapter — Codex's framing was "small
PHY shim chapter, not a full display-stack leap. Get pixels off-
chip before making them pretty." Replace the abstract
`VIDEO_R/G/B/HSYNC/VSYNC/DE` virtual pins with real DE25-Nano
HDMI transmitter signals; the ADV7513 chip itself stays asleep
(its I²C wake-up FSM is the Ch165+ chapter), so the bitstream
makes the FPGA pins toggle correctly but a real monitor stays
dark until Ch165 lands.
**Wrapper change** ([rtl/top/de25_nano_psmct32_raster_demo_top.sv](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv)):
five new top-level outputs added — `HDMI_TX_CLK` (= `design_clk`,
the 50 MHz pixel clock), `HDMI_TX_D[23:0]` packing
`{VIDEO_R, VIDEO_G, VIDEO_B}` (R in MSBs, ADV7513 default 24-bit
RGB), and `HDMI_TX_HS / HDMI_TX_VS / HDMI_TX_DE` mirroring the
abstract VIDEO_* signals. The VIDEO_* ports are kept on the
wrapper as `VIRTUAL_PIN ON` (the Ch149 board TB references them
via hierarchical probe).
**QSF change** ([synth/.../de25_nano_psmct32_raster_demo_top.qsf](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.qsf)):
HDMI pinout sourced from
[`retroDE_nes/retroDE_nes.qsf`](../../../retroDE_nes/retroDE_nes.qsf)
for the same DE25-Nano (Terasic Agilex 5) board — `HDMI_TX_CLK`
on `PIN_DJ24` with 1.1-V IO standard (matches the on-board level
shifter), data + sync pins on 3.3-V LVCMOS. The companion
ADV7513 control pins (`HDMI_I2C_SCL`, `HDMI_I2C_SDA`,
`HDMI_TX_INT`, `HDMI_MCLK`) are intentionally NOT pinned — the
chip stays in standby on power-up and ignores its 24-bit RGB
input until the I²C wake-up FSM lands in Ch165+.
**SDC change** ([synth/.../de25_nano_psmct32_raster_demo_top.sdc](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc)):
`set_false_path -to` for each HDMI output port. Proper
`set_output_delay` constraints with respect to a generated
`HDMI_TX_CLK` domain land alongside the Ch165+ wake-up FSM,
when the ADV7513's actual setup/hold window comes out of the
chip's datasheet pass.
**Scaffold-check extension** ([sim/Makefile](../../sim/Makefile)):
`top_psmct32_raster_demo_quartus_scaffold_check` now also
verifies `HDMI_TX_CLK + HDMI_TX_D[0..23] + HS/VS/DE` are
pin-assigned (sentinel set; not exhaustive) — fails the gate
if Quartus would auto-place them on arbitrary package pins.
**Quartus result vs Ch163 (50 MHz)**:
| Metric | Ch163 (50 MHz, no HDMI pins) | Ch164 (50 MHz + HDMI pins) |
|-----------------------------|-------------------------------|-------------------------------|
| Fit ALMs | 27,543 / 46,800 (59 %) | 27,271 / 46,800 (58 %) |
| Fit RAM / PLL blocks | 14 / 1 | 14 / 1 |
| **Fit pins** | **17 / 351 (5 %)** | **45 / 351 (13 %)** (+28 HDMI) |
| Setup slack worst (design) | +7.500 ns | +7.536 ns |
| Fmax (design domain) | 80.0 MHz | ~80 MHz (unchanged) |
| `quartus_asm` | Successful | Successful (`.sof` produced) |
The +28 pins are exactly the new HDMI shim — 24 RGB lanes, 1
clock, 3 sync (HS / VS / DE). Setup slack stays at ~+7.5 ns
because the new pins are `false_path`'d — STA doesn't time
anything against them yet. ALMs ticked down slightly as the
fitter rebalanced under the wider pin map.
**Snapshot**: Ch163 50 MHz baseline preserved at
[`baseline_ch163_50mhz/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch163_50mhz/)
(syn / fit / sta summaries + parse_report + .sof). The
[`baseline_ch163_30mhz/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch163_30mhz/)
30-MHz milestone is also preserved.
**Open Ch165+ items**:
- **ADV7513 I²C wake-up FSM** — without this the HDMI port
outputs nothing on a real monitor. Ch165 owns the chip
bring-up: pin `HDMI_I2C_SCL` / `HDMI_I2C_SDA` /
`HDMI_TX_INT` / `HDMI_MCLK`, drop in an I²C master that
walks the canonical ADV7513 register-set (sourced from
`retroDE_nes`'s working bring-up).
- Proper `set_output_delay` constraints once the ADV7513
setup/hold window is documented (replacing Ch164's
`false_path`).
- Make the rendered pattern bigger than Ch123's 16×8 SPRITE so
there's something visible to admire on a real screen.
- xfer-side T4 coverage TB (still open from Ch157+).
- `useg_shadow_mem` BRAM-shape forensics.
**Sim regression**: 143 PASS / 0 FAIL unchanged — no RTL
changes that touched sim semantics; the new HDMI ports are
combinational mirrors of existing VIDEO_* signals, and
`tb_de25_nano_psmct32_raster_demo_top` references VIDEO_*
unchanged.
### Wake the ADV7513 — first .sof that drives a real HDMI monitor (Ch165)
Ch165 turns "FPGA pins toggling" into "monitor has a fighting
chance of showing the tiny frame" — Codex's framing for the
chapter. The ADV7513 chip stays in standby on power-up; an I²C
master needs to walk a canonical register-write sequence to
configure 24-bit RGB input + sync polarity + power-up + HPD
override before the chip will accept the FPGA's HDMI_TX_*
data and drive the connector.
**Modules ported** (Terasic DE-series reference design, free
use on Terasic hardware per the license that ships with the
DE25-Nano System CD; copyright retained):
- [`rtl/platform/I2C_Controller.v`](../../rtl/platform/I2C_Controller.v)
— bit-bang I²C master with 23-step transaction layout (start /
slave-addr / sub-addr / data / stop, ~50 µs per byte at the
derived 20 kHz I²C clock).
- [`rtl/platform/I2C_HDMI_Config.v`](../../rtl/platform/I2C_HDMI_Config.v)
— wake-up FSM that walks a 38-entry LUT of ADV7513 register
writes (slave 0x72): power-up + HPD override + audio init +
AVI InfoFrame for full-range RGB 444 + dither + clock-divide +
HDMI mode select. Adapted from the
`retroDE_splash/rtl/platform/` versions (same DE25-Nano
board); LUT customizations (HPD override, AVI InfoFrame for
full-range RGB) carry through.
**Wrapper changes** ([rtl/top/de25_nano_psmct32_raster_demo_top.sv](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv)):
- Four new top-level ports: `inout HDMI_I2C_SCL`,
`inout HDMI_I2C_SDA` (open-drain I²C bus), `input HDMI_TX_INT`
(chip's HPD / monitor-sense interrupt, active-low), and
`output HDMI_MCLK` (audio sample-rate reference, driven by
CLOCK2_50 since the demo is video-only).
- `I2C_HDMI_Config u_hdmi_i2c` instantiated. Clocked on
`CLOCK2_50` (NOT `design_clk` — the wake-up runs even before
the PLL locks); reset on `~ninit_done` (raw async reset; the
I²C bus stays held in a clean state until FPGA init
completes). Output `READY` (= `hdmi_init_done`) goes high
after the LUT walk; `HDMI_TX_INT` going low retriggers the
walk so a late hot-plug after FPGA boot still wakes the chip.
- New status LED: `LED[3] = ~hdmi_init_done` (active-low; lit
means the chip is configured). `LED[7:4]` retie at HIGH.
**QSF + files.f + sim Makefile**:
[QSF](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.qsf)
gains pin assignments for the 4 new control pins (sourced from
`retroDE_nes`: `BT1` / `BW2` / `CF2` / `CF1`) plus IO standards
(3.3-V LVCMOS for everything). The two new platform Verilog
sources are added to the QSF source list, the synth
[files.f](../../synth/de25_nano/top_psmct32_raster_demo/files.f),
and the sim Makefile's `RTL_SRCS`. The
[scaffold-check](../../sim/Makefile)
extends to verify all 4 control pins are pin-assigned + IO
standard'd, alongside the Ch164 24-pin HDMI data set.
**SDC change**
([de25_nano_psmct32_raster_demo_top.sdc](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc)):
`set_false_path -to / -from` on the new control pins. The I²C
bus runs at ~20 kHz (50 µs per SCL period) and is inherently
async to the design clock; HDMI_MCLK is driven by CLOCK2_50 and
sampled by the chip's audio PLL — both well below any
constraint on the fabric.
**Quartus result vs Ch164**:
| Metric | Ch164 (HDMI data only) | Ch165 (HDMI data + wake-up) |
|-------------------------|-------------------------------|-------------------------------|
| Fit ALMs | 27,271 / 46,800 (58 %) | 27,374 / 46,800 (58 %) |
| Fit RAM / PLL blocks | 14 / 1 | 14 / 1 (unchanged) |
| **Fit pins** | **45 / 351** | **49 / 351** (+4 control) |
| Setup slack worst | +7.536 ns | +7.198 ns |
| `quartus_asm` | Successful | Successful (`.sof` produced) |
The +103 ALMs are the I²C controller's bit-bang state machine
and the 38-entry LUT walker. STA stays positive on every
class — the wake-up FSM lives entirely on the I²C-clock domain
(slow), and Recovery analysis on `iRST_N` async-deassert is
cleanly +17.621 ns of slack.
**TB note** — `tb_de25_nano_psmct32_raster_demo_top` (the
Ch149 board smoke) wires up the new HDMI_TX_INT input
(tied high = no interrupt) and leaves the I²C SCL/SDA lines
floating; the wake-up FSM walks the LUT but full completion
takes ~125 ms simulated at the production divider
(controller-clock period ~100 µs × 33 phases/byte × 38 bytes),
far longer than the existing 5 ms TB runtime. The board TB
doesn't observe `hdmi_init_done` directly — it pre-dates the
wake-up FSM and only smoke-tests the wrapper. The Ch165 audit
landed `tb_hdmi_i2c_wake_smoke` (`sim/tb/top/`), which
overrides `CLK_Freq / I2C_Freq` to collapse the divider so the
walk runs in microseconds and asserts the LUT walk + READY
rise + HDMI_TX_INT retrigger + open-drain SDA + the Ch166
sticky NACK watchdog. Ch167 added a bus-level byte-sequence
lock: the TB switched its SDA model from pulldown to
pullup + a phase-aware slave-ACK driver (drives strong-LOW
exactly when `u_dut.u0.phase` is `PH_ACK0/1/2`, releases
otherwise so the master's data bits are visible on the
wire). A decoder samples SDA on each SCL rising edge
between START and STOP, assembles the three bytes per
transaction into a 24-bit `{dev_addr, reg, data}` tuple,
and compares against `u_dut.mI2C_DATA[23:0]` snapshotted
on `mI2C_GO` rising edges. Asserts: 38 captured == 38
intent, every byte matches, every dev_addr is `8'h72`.
The Phase 3 open-drain check also flipped semantics from
"SDA never strong-HIGH" to "SDA never `'x`" (the right
violation test for the pullup bus).
**Snapshots**: Ch164 baseline preserved at
[`baseline_ch164/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch164/);
Ch165 baseline at
[`baseline_ch165/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch165/).
**Open Ch168+ items**:
- Proper `set_output_delay` constraints on HDMI_TX_* once the
ADV7513 setup/hold window is locked from the bring-up
datasheet pass (replaces the Ch164 `set_false_path -to`).
- Make the rendered pattern bigger than Ch123's 16×8 SPRITE
so there's something visible to admire on a real screen.
- xfer-side T4 coverage TB (still open from Ch157+).
- `useg_shadow_mem` BRAM-shape forensics.
**Sim regression**: 144 PASS / 0 FAIL.
`tb_de25_nano_psmct32_raster_demo_top` PASSES with the new
HDMI control ports wired up (HDMI_TX_INT held high in the
TB; LED=`0b11111000` shows the existing 3 status LEDs lit
— LED[3] stays unlit because the LUT walk doesn't complete
in 5 ms of sim). `tb_hdmi_i2c_wake_smoke` PASSES the
accelerated bring-up + Ch166 NACK-watchdog assertions.
### Hardware-readiness pass for the Ch123 PSMCT32 raster demo (Ch144)
Ch144 is a synthesis/FPGA-readiness audit around the first
hardware-demo candidate (Ch123 PSMCT32 raster e2e, marked above).
No RTL changes — Ch144 documents what a top-level FPGA wrapper
needs to know before attempting a first build.
**RTL dependency tree (Ch123-only)** — what the demo *actually*
instantiates. The full `RTL_SRCS` list compiled by sim contains
~40 modules; Ch123 only reaches these 11, plus the swizzle math
primitive that the three swizzle-aware modules each instantiate
internally:
| Module | Role in Ch123 |
|------------------------------|-------------------------------------------------------------|
| `bios_rom_stub` | EE bootlet at 0xBFC0_0000 (~18 instructions) |
| `ee_ram_stub` | DMAC-side GIF payload (~24 qwords) |
| `ee_memory_map_stub` | EE-CPU + DMAC + bios + map's GS-priv decode |
| `ee_core_stub` | MIPS R5900 core running the bootlet |
| `ee_gs_priv_bridge_stub` | EE 32-bit MMIO → 64-bit GS-priv reg writes |
| `dmac_reg_stub` | DMAC ch2 NORMAL transfer |
| `gif_packed_stub` | GIFtag + PACKED A+D parser |
| `gs_stub` | GS register file + raster (`PSMCT32_SWIZZLE=1`) |
| `gif_image_xfer_stub` | TRXDIR/IMAGE engine (`PSMCT32_SWIZZLE=1`, dormant in Ch123) |
| `vram_stub` | 8 KiB VRAM (one PSMCT32 page) |
| `gs_pcrtc_stub` | PCRTC scanout (`PSMCT32_SWIZZLE=1`) |
| `gs_swizzle_psmct32_stub` | Pure-comb math, instantiated x3 inside the gates above |
**Sim-only constructs audit** (full sweep of the 12 modules
above):
- `bios_rom_stub.sv` and `ee_ram_stub.sv` — `$display` /
`$readmemh` inside `initial begin`. Both are synth-safe:
Xilinx Vivado and Intel Quartus support `$readmemh` for BRAM
initialization, and `$display` is silently ignored by all
major synthesizers.
- `vram_stub.sv` L114-117 — single `$error` parameter validator
inside `initial begin`. Synth ignores it; the BYTES parameter
must be set to a sane value at instantiation regardless.
- `ee_gs_priv_bridge_stub.sv` L118 — runtime `$error` on
unsupported byte enables, inside `always_ff`. Synth ignores
the `$error`; the surrounding logic still synthesizes
correctly.
- **No** `$finish` / `$dumpfile` / `$random` / `force` /
`release` / `real`-typed signals / hierarchical refs in any
module of the **Ch123 dep tree**. (TBs use hierarchical refs
into `bios_rom_stub` to preload the bootlet — that's a TB-
only concern; on hardware the bootlet image is the BRAM init.
Out-of-tree note: `boot_install_agent_stub.sv` (SIF subsystem,
not in the Ch123 dep tree) contains a `$fatal` runtime
validator, but it is never compiled into the Ch123 hardware
build.)
**Memory sizing**:
| Memory | Default | Ch123 sim setting | Ch123 hw recommendation | FPGA fit |
|--------------------|---------------|-------------------|-------------------------|----------------------------------|
| `bios_rom_stub` | 4 MiB | 4 KiB | 4 KiB | ≤1 BRAM tile |
| `ee_ram_stub` | 16 KiB | 4 KiB | 4 KiB | ≤1 BRAM tile |
| `vram_stub` | 64 KiB | 8 KiB | 8 KiB | ≤2 BRAM tiles (one PSMCT32 page) |
| `ee_memory_map_stub.useg_shadow_mem` (Ch145) | 4 MiB | 4 MiB | **4 KiB** (override `USEG_SHADOW_WORDS_PARAM=1024`) | ≤1 BRAM tile when overridden |
The 16×8 framebuffer needs only 16×8×4 = 512 bytes; 8 KiB gives
the full first PSMCT32 page (FBP=0). For a more ambitious
hardware demo (multi-page framebuffers, textures), grow
`vram_stub.BYTES` toward 1 MiB / 4 MiB. Real PS2 has 4 MiB of
VRAM; a first hardware build can stay at 8 KiB.
**Ch145 — `useg_shadow_mem` parameterization**: pre-Ch145, the
ee_memory_map_stub's useg-shadow backing was a fixed 1M-word /
4 MiB array. That was correct for the BIOS-smoke chapters that
need full first-4-MiB-of-useg coverage, but it's wasted area
for the Ch123 hardware demo (which never touches useg — the
bootlet runs from BIOS at 0xBFC0_0000 and the GIF payload from
RAM at phys 0x100). Ch145 promotes `USEG_SHADOW_WORDS` from a
hardcoded `localparam` to the `USEG_SHADOW_WORDS_PARAM` module
parameter (default 1M words = 4 MiB → existing TBs unchanged).
For the Ch123 hardware demo, the top-level wrapper instantiates
`ee_memory_map_stub` with `.USEG_SHADOW_WORDS_PARAM(1024)` to
shrink the inferred BRAM footprint by ~1024×; correctness is
unaffected because no useg access ever happens in the Ch123
data plane.
**Clock / reset assumptions**:
- Single clock domain (`clk`) — all 12 modules share one input.
- Active-low synchronous reset input (`rst_n`) — also a single
shared input. No reset gating, no per-module variants. The
reset is sampled inside `always_ff @(posedge clk)` via the
`if (!rst_n)` pattern (NOT `posedge clk or negedge rst_n`) —
i.e., it is NOT an async reset despite being active-low. On
FPGA this should be brought up via the device's reset bridge
so the deasserting edge is synchronous to `clk`.
- No clock gating, no derived clocks. The PCRTC's hsync/vsync/de
are regular clock-domain outputs, not separate clocks.
**Swizzle gate parameter defaults**:
- All four swizzle parameters (`PSMCT32_SWIZZLE`,
`PSMCT16_SWIZZLE`, `PSMT8_SWIZZLE`, `PSMT4_SWIZZLE`) default
to `1'b0` on `gs_stub`, `gs_pcrtc_stub`, and
`gif_image_xfer_stub`. For the Ch123 hardware demo,
instantiate these three modules with **`PSMCT32_SWIZZLE(1'b1)`**
and the other three left at `1'b0`. The swizzle-math
primitives (`gs_swizzle_psmct32_stub` etc.) are pure-comb and
trim cleanly when their gate is off.
**Top-level harness expectations** (for a future
`top_psmct32_raster_demo.sv`):
- Inputs: `clk`, `rst_n`, plus board-level video-out connections
(HDMI / DVI / VGA — driven by `r/g/b/hsync/vsync/de` from
`gs_pcrtc_stub`).
- The EE bootlet image must be preloaded into `bios_rom_stub`
via either `IMAGE_FILE` (→ `$readmemh`) or a bake-step that
writes a `.mem` next to the synthesis project. The bootlet is
18 MIPS instructions (currently authored procedurally in the
Ch123 TB body via `ee_prog_word()`); for hardware this needs
to become a static `.mem` checked into the repo.
- The GIF payload must be preloaded into `ee_ram_stub` via the
same mechanism — 24 qwords starting at `PAYLOAD_MADR=0x100`.
Current TB authors them procedurally with `preload_qword()`;
hardware needs a static `.mem`.
- The `core_go` signal must be tied high (or pulsed by a board
reset-release sequencer) so the EE starts fetching from
`0xBFC0_0000`.
**Known sim-only constructs that should NOT block first build**:
- `$display` lines in BIOS/RAM init (synth ignores).
- `$readmemh` (synth tools handle it for BRAM init).
- `$error` parameter validators (synth ignores).
**Known sim-only constructs that WOULD block first build**:
- None found in the Ch123 dep tree.
**Open questions for the hardware-build session** (deliberately
not answered here — they need a board-level decision):
- Target FPGA family + clock frequency (PCRTC was designed
around 13.5 MHz pixel clock for the 16×8 active area; first
build can run at any clock since the TB doesn't model real
CRTC timing).
- Video-out PHY (HDMI core, VGA DAC, on-board HDMI
transmitter chip).
- BIOS / payload bake step (Vivado `update_compile_order` +
`.mem` files vs. a SystemVerilog `localparam` array
preload).
- Whether to keep `ee_core_stub`'s `STRICT_UNSUPPORTED` gate
active on hardware (catches unknown opcodes by halt+latch —
useful for debugging, but a hard failure on any unintended
fetch).
The Ch90 white-box TB `tb_gs_scanout_basic.sv` exercises the
full round trip: instantiates `gs_stub` + `vram_stub` +
`gs_pcrtc_stub`, drives a 4×4 sprite through the GIF reg port,
waits for raster to fully drain, then enables scanout and
captures one full frame's `(hcnt, vcnt) → (r, g, b)` trace.
Asserts: every pixel inside the sprite reads as the emitted
color, every pixel outside reads as 0, and at least one EV_MODE
frame trace fires.