ec82764bef
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression (272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps, and all dump-derived textures/traces) is excluded via .gitignore and stays local. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
4234 lines
232 KiB
Markdown
4234 lines
232 KiB
Markdown
# GIF/GS Contract
|
||
|
||
Status: `Draft`
|
||
|
||
## Purpose
|
||
|
||
Define the graphics ingress and rendering/display boundary.
|
||
|
||
## Owns
|
||
|
||
- GIF path intake and arbitration,
|
||
- GIF tag interpretation,
|
||
- GS register decode,
|
||
- GS VRAM-visible operations,
|
||
- framebuffer/zbuffer/texture-visible state handling,
|
||
- PCRTC/display output generation or a planned approximation layer.
|
||
|
||
## Inputs
|
||
|
||
- DMAC channel 2 traffic,
|
||
- VIF/VU-generated graphics traffic,
|
||
- privileged GS register writes,
|
||
- reset and display configuration controls.
|
||
|
||
## Outputs
|
||
|
||
- VRAM updates,
|
||
- display timing and pixel output,
|
||
- status/interrupt signals,
|
||
- packet and register trace events.
|
||
|
||
## Questions to lock
|
||
|
||
- What is the first output milestone:
|
||
- GS privileged register acceptance only
|
||
- static background color
|
||
- minimal primitive draw
|
||
- `gsKit`-style demo target
|
||
- Is Phase 1 display based on a faithful GS/PCRTC path or a temporary adapter?
|
||
- What VRAM organization assumptions must stay stable from the beginning?
|
||
|
||
## Allowed early stubs
|
||
|
||
- privileged-register-only GS stub,
|
||
- BGCOLOR/test-pattern display path,
|
||
- packet logger with no rendering.
|
||
|
||
## Required debug visibility
|
||
|
||
- GIF tags,
|
||
- PATH source and arbitration result,
|
||
- GS register writes,
|
||
- VRAM write summaries,
|
||
- display mode transitions.
|
||
|
||
## First meaningful milestone
|
||
|
||
- a known packet stream or direct privileged-register sequence produces a stable,
|
||
visible, repeatable output and matching trace.
|
||
|
||
## GS write-port contract (Ch75)
|
||
|
||
The GS model has **two architecturally distinct write ports** because real PS2
|
||
hardware exposes two unrelated register namespaces. Conflating them was a Ch74
|
||
mistake; Ch75 split them.
|
||
|
||
### `reg_wr_*` — privileged GS/MMIO writes
|
||
|
||
- Source: CPU MMIO writes to the `0x12000000` privileged-register block, e.g.
|
||
via `platform_video_stub` or a direct test-harness path.
|
||
- Address: `reg_wr_addr[15:0]` is the offset *inside* the privileged block.
|
||
- Examples: `BGCOLOR` at offset `0x00E0`, `PMODE` at `0x0000`,
|
||
`SMODE2` at `0x0020`, etc.
|
||
- Currently latched: `BGCOLOR` only. Other offsets emit `EV_MODE`.
|
||
|
||
### `gif_reg_*` — GIF A+D register-number writes
|
||
|
||
- Source: `gif_packed_stub` consuming a PACKED A+D entry when run with
|
||
`REAL_AD_REG_MAP=1` (the new default-on path for real PS2 packets;
|
||
parameter still defaults to `0` for back-compat with project-local
|
||
Ch72/Ch73 PACKED-A+D layout).
|
||
- Address: `gif_reg_num[7:0]` is the **GIF A+D register number** straight
|
||
out of the PACKED entry's `in_data[71:64]`. Source-of-truth is PCSX2
|
||
`pcsx2/GS/GSRegs.h`.
|
||
- Currently decoded: `PRIM=0x00`, `RGBAQ=0x01`, `XYZF2=0x04`, `XYZ2=0x05`,
|
||
`FRAME_1=0x4C`, `ZBUF_1=0x4E` (**not `0x4F` — that is `ZBUF_2`**).
|
||
Each has a dedicated 64-bit latch output. Other reg numbers emit `EV_MODE`.
|
||
|
||
### Event taxonomy
|
||
|
||
The two write paths emit different events. Read this carefully — `arg2`
|
||
semantics differ across emitters.
|
||
|
||
- `EV_BGCOLOR` — emitted **only** by `gs_stub` on the privileged port
|
||
when `reg_wr_addr == 0x00E0`. Carries the unpacked R/G/B in
|
||
`arg0`/`arg1`/`arg2`. The privileged port has no per-register
|
||
"selector" beyond this dedicated event; everything else on that port
|
||
goes to `EV_MODE` with `arg0=offset`, `arg1=data`.
|
||
|
||
- `EV_WRITE` — emitted in two places with different `arg2` semantics:
|
||
- **`gif_packed_stub`** on a PACKED A+D accept (REGS nibble = `0xE`).
|
||
Carries the raw PACKED address bits in `arg2` (`{48'd0,
|
||
in_data[79:64]}`). Under `REAL_AD_REG_MAP=1` the low 8 bits are the
|
||
real GIF reg# (`in_data[71:64]`); under `REAL_AD_REG_MAP=0` the low
|
||
16 bits are the project-local privileged-style offset. **Not a
|
||
stable selector — it is the address half of the wire.**
|
||
- **`gs_stub`** on the `gif_reg_*` port for a tracked GIF reg
|
||
(PRIM/RGBAQ/XYZF2/XYZ2/FRAME_1/ZBUF_1). Carries a **stable
|
||
per-register selector** in `arg2`: `1=PRIM, 2=RGBAQ, 3=XYZF2,
|
||
4=XYZ2, 5=FRAME_1, 6=ZBUF_1, 7=TEX0_1` (Ch98). `arg0=reg#`,
|
||
`arg1=data`. Use this
|
||
selector for trace-side filtering; it does not depend on
|
||
`REAL_AD_REG_MAP`.
|
||
- **Ch76 caveat**: a tracked vertex commit (XYZ2 or XYZF2) on the
|
||
`gif_reg_*` port that *closes* a primitive does NOT emit EV_WRITE
|
||
that cycle — `EV_PRIM_DRAW` preempts it (see below). The xyz2_q /
|
||
xyzf2_q latch still updates. Trace consumers counting "vertices
|
||
seen" must sum `EV_WRITE`(selector=3 or 4) + `EV_PRIM_DRAW` to get
|
||
the true total.
|
||
|
||
- `EV_PRIM_DRAW` — Ch76 / Ch77. Fired by `gs_stub` once per primitive
|
||
completion: when an XYZ2 or XYZF2 vertex commit on the `gif_reg_*`
|
||
port closes a primitive under the current `PRIM[2:0]`. Preempts the
|
||
EV_WRITE that the closing vertex would otherwise have emitted.
|
||
Args: `arg0=PRIM[2:0]` (prim type), `arg1=primary threshold`,
|
||
`arg2=cumulative `prim_complete_count` post-increment`,
|
||
`arg3=closing vertex data` (the same 64 bits that latched into
|
||
xyz2_q / xyzf2_q on this cycle).
|
||
- **Discrete primitives** (POINT=1, LINE=2, TRIANGLE=3, SPRITE=2):
|
||
one draw per N vertices; the vertex counter resets to 0 after each
|
||
draw.
|
||
- **Strip / fan primitives** (LINE_STRIP=2, TRI_STRIP=3, TRI_FAN=3):
|
||
Ch77. Anchor on the first N vertices, then fire one draw per
|
||
additional vertex commit. The vertex counter saturates at the
|
||
primary threshold so every subsequent vertex closes another
|
||
primitive. Ch78 adds **vertex-identity tracking** distinguishing
|
||
TRI_STRIP rolling triangles `{v_n-2, v_n-1, v_n}` from TRI_FAN
|
||
pivot triangles `{v_pivot, v_n-1, v_n}` — see the next section.
|
||
- **Reserved** (PRIM=7): no draw, vertex commits do not increment
|
||
the counter, latches still update.
|
||
- A PRIM write always resets the vertex counter so a fresh
|
||
primitive type starts cleanly.
|
||
|
||
### Per-primitive vertex snapshot (Ch78)
|
||
|
||
Alongside `EV_PRIM_DRAW`, `gs_stub` exposes three 64-bit outputs —
|
||
`prim_v0_q`, `prim_v1_q`, `prim_v2_q` — that hold the *vertex tuple*
|
||
of the most recently closed primitive. Snapshot is registered on the
|
||
same clock edge as the `ev_valid` pulse and held until the next
|
||
`prim_complete`, so a TB can sample it at the same time it sees
|
||
`gs_ev_event == EV_PRIM_DRAW`.
|
||
|
||
The number of valid slots is implicit in `PRIM[2:0]`:
|
||
|
||
| `PRIM` | type | valid slots | semantics |
|
||
|---|---|---|---|
|
||
| 0 | POINT | `v0` | the single vertex |
|
||
| 1 | LINE | `v0`, `v1` | endpoints |
|
||
| 2 | LINE_STRIP | `v0`, `v1` | each segment uses `{v_n-1, v_n}` |
|
||
| 3 | TRIANGLE | `v0`, `v1`, `v2` | the three vertices |
|
||
| 4 | TRI_STRIP | `v0`, `v1`, `v2` | rolling: `{v_n-2, v_n-1, v_n}` |
|
||
| 5 | TRI_FAN | `v0`, `v1`, `v2` | pivot+rolling: `{v_pivot, v_n-1, v_n}` |
|
||
| 6 | SPRITE | `v0`, `v1` | top-left + bottom-right |
|
||
| 7 | reserved | — | observer never closes |
|
||
|
||
The TRI_STRIP-vs-TRI_FAN distinction lives entirely in the
|
||
saturated-extension path: a TRI_STRIP advances `v0` each draw with
|
||
the rolling window; a TRI_FAN pins `v0` to `v_pivot` (the first
|
||
vertex committed since the most recent PRIM write). On the *anchor*
|
||
draw, `v_pivot` and the rolling `v_prev` happen to coincide, so
|
||
TRI_STRIP and TRI_FAN report the same tuple for their first
|
||
triangle.
|
||
|
||
A PRIM write clears the rolling window (`v_curr` / `v_prev` /
|
||
`v_prev_prev` / `v_pivot` / `pivot_seen`) so a fresh primitive
|
||
context starts with no residual vertex bleed. Slots not used by the
|
||
current primitive type read `0`.
|
||
|
||
The snapshot tracks identity, not geometry — the values written are
|
||
the raw 64-bit `gif_reg_data` payloads of XYZ2 / XYZF2 commits, with
|
||
no decoding into screen-space coordinates. Rasterization is still
|
||
out of scope.
|
||
|
||
### Per-primitive color snapshot (Ch79 / Ch80)
|
||
|
||
`prim_color_q[63:0]` is registered on the same edge as
|
||
`prim_v0_q` / `prim_v1_q` / `prim_v2_q` and carries the value of
|
||
`rgbaq_q` at the moment the primitive closed. RGBAQ writes are
|
||
separate A+D entries from XYZ2 / XYZF2 commits (gif_packed_stub
|
||
serializes A+D to one accept per cycle), so `rgbaq_q` is always
|
||
settled to its draw-time value when `prim_complete_now` fires.
|
||
|
||
`prim_color_q` reads `0` if no RGBAQ has been written since reset;
|
||
`rgbaq_q` itself is **not** cleared on a PRIM write — color carries
|
||
forward across PRIM context switches, matching real GS behavior —
|
||
but it does reset to `0` on `rst_n`.
|
||
|
||
#### Per-vertex Gouraud color (Ch80)
|
||
|
||
For real game streams that interleave RGBAQ writes with vertex
|
||
commits to drive Gouraud shading, `gs_stub` exposes three
|
||
additional outputs:
|
||
|
||
| Output | Slot semantics |
|
||
|---|---|
|
||
| `prim_color_v0_q[63:0]` | color of vertex 0 |
|
||
| `prim_color_v1_q[63:0]` | color of vertex 1 |
|
||
| `prim_color_v2_q[63:0]` | color of vertex 2 |
|
||
|
||
A parallel rolling color window (`c_curr_q` / `c_prev_q` /
|
||
`c_prev_prev_q` / `c_pivot_q`, internal) samples `rgbaq_q` on
|
||
every vertex commit, mirroring the Ch78 vertex-identity window.
|
||
The snapshot layout matches the vertex layout exactly:
|
||
|
||
| `PRIM` | type | `_v0_q` color of | `_v1_q` color of | `_v2_q` color of |
|
||
|---|---|---|---|---|
|
||
| 0 | POINT | the single vertex | 0 | 0 |
|
||
| 1 | LINE | first endpoint | closing | 0 |
|
||
| 2 | LINE_STRIP | previous vertex | closing | 0 |
|
||
| 3 | TRIANGLE | `v_n-2` | `v_n-1` | closing |
|
||
| 4 | TRI_STRIP | `v_n-2` (rolls) | `v_n-1` | closing |
|
||
| 5 | TRI_FAN, anchor | `v1` (≡ pivot) | `v2` | `v3` |
|
||
| 5 | TRI_FAN, saturated | `v_pivot` (PINNED) | `v_n-1` | closing |
|
||
| 6 | SPRITE | first endpoint | closing | 0 |
|
||
|
||
`prim_color_q` is exactly the closing-vertex color (≡
|
||
`prim_color_v_close`), kept as a convenience alias for consumers
|
||
that don't care about Gouraud.
|
||
|
||
For **flat-shaded** primitives (RGBAQ written once before the
|
||
strip), all per-vertex color slots used by the primitive equal
|
||
each other and equal `prim_color_q`. For **Gouraud-shaded**
|
||
primitives (RGBAQ rewritten between vertex commits), the slots
|
||
may differ — capturing the per-vertex color identity needed to
|
||
distinguish a strip's rolling colors from a fan's pivot color.
|
||
|
||
The color window is **cleared on PRIM write** (unlike `rgbaq_q`
|
||
itself, which carries forward). This means per-vertex color
|
||
identity stays tied to the current primitive context — a stream
|
||
that switches PRIM types mid-context starts color tracking fresh
|
||
for the new context. Slots not used by the current primitive type
|
||
read `0`.
|
||
|
||
Like the vertex snapshot, this captures identity, not interpolated
|
||
geometry — the stored values are the raw 64-bit RGBAQ payloads
|
||
(packing R, G, B, A, and the texture-coord divisor Q together);
|
||
GS-style Gouraud interpolation across the primitive interior
|
||
remains out of scope.
|
||
|
||
### Structured-field decode (Ch81)
|
||
|
||
`gs_stub` exposes pre-decoded snapshot outputs alongside the raw
|
||
64-bit slots so a downstream rasterizer or pixel-emit path doesn't
|
||
have to re-derive bit fields:
|
||
|
||
| Output | Type | Carries |
|
||
|---|---|---|
|
||
| `prim_v0_decoded_q` / `_v1_` / `_v2_` | `trace_pkg::vertex_t` | `x` / `y` / `z` / `fog` / `is_xyzf2` per slot |
|
||
| `prim_v0_color_decoded_q` / `_v1_` / `_v2_` | `trace_pkg::color_t` | `r` / `g` / `b` / `a` / `q` per slot |
|
||
|
||
The decoded outputs latch on the same edge as the raw snapshots, so
|
||
a TB samples both atomically with `EV_PRIM_DRAW`.
|
||
|
||
#### vertex_t and the XYZ2 / XYZF2 distinction
|
||
|
||
```sv
|
||
typedef struct packed {
|
||
logic is_xyzf2; // 1 = XYZF2 source, 0 = XYZ2
|
||
logic [7:0] fog; // valid iff is_xyzf2; else 0
|
||
logic [31:0] z; // 32-bit (XYZ2) or zero-extended 24-bit (XYZF2)
|
||
logic [15:0] y; // 12.4 fixed-point screen Y
|
||
logic [15:0] x; // 12.4 fixed-point screen X
|
||
} vertex_t;
|
||
```
|
||
|
||
XYZ2 packs full 32-bit Z in `data[63:32]`. XYZF2 packs 24-bit Z in
|
||
`data[55:32]` and an 8-bit fog byte in `data[63:56]`. The `is_xyzf2`
|
||
flag is registered in a parallel rolling format-flag window
|
||
(`xyzf2_curr_q` / `xyzf2_prev_q` / `xyzf2_prev_prev_q` /
|
||
`xyzf2_pivot_q`) that tracks the source format of each vertex
|
||
through the rolling window — so when an XYZF2 vertex rolls into
|
||
the `v_prev` slot of a TRI_STRIP saturated extension, its
|
||
`is_xyzf2` flag rolls with it.
|
||
|
||
Cleared on `rst_n` and on PRIM write, same as the vertex/color
|
||
windows.
|
||
|
||
#### color_t
|
||
|
||
```sv
|
||
typedef struct packed {
|
||
logic [31:0] q; // texture-coord divisor (IEEE float)
|
||
logic [7:0] a;
|
||
logic [7:0] b;
|
||
logic [7:0] g;
|
||
logic [7:0] r;
|
||
} color_t;
|
||
```
|
||
|
||
Direct bit-slice of the RGBAQ payload — no interpretation. Q is
|
||
carried verbatim as a 32-bit IEEE float (the GS uses it for
|
||
texture coordinate division during rasterization, which remains
|
||
out of scope).
|
||
|
||
#### Decode helper functions
|
||
|
||
`trace_pkg` exposes `decode_vertex(data, is_xyzf2)` and
|
||
`decode_color(data)` so downstream code can re-decode raw 64-bit
|
||
values consistently with the `gs_stub` snapshot.
|
||
|
||
The decoded outputs are an additive contract — the raw `prim_v*_q`
|
||
and `prim_color_v*_q` outputs continue to work for consumers that
|
||
don't care about per-channel decoding.
|
||
|
||
### Minimal pixel emit (Ch82)
|
||
|
||
`gs_stub` exposes a per-primitive *pixel emit* — the smallest
|
||
possible output that ties the recognition layer to a framebuffer
|
||
destination. One pixel per closed primitive (the closing vertex,
|
||
in screen-space integer coords), addressed by the latched
|
||
`frame_1_q` register. No interpolation, no coverage, no
|
||
rasterization — this is the contact point for a future raster
|
||
chapter, not a substitute for one.
|
||
|
||
| Output | Width | Carries |
|
||
|---|---|---|
|
||
| `pixel_emit` | 1 | 1-cycle strobe; pulses on the same edge as `prim_complete` |
|
||
| `pixel_emit_count` | 32 | Running tally of emits since reset |
|
||
| `pixel_x_q` / `pixel_y_q` | 12 | Closing vertex integer screen coords (top 12 bits of 12.4 fixed-point) |
|
||
| `pixel_color_q` | 64 | RGBAQ at the emit moment (= `prim_color_q`) |
|
||
| `pixel_fbp_q` | 9 | `FRAME_1[8:0]` (framebuffer base / 2048) |
|
||
| `pixel_fbw_q` | 6 | `FRAME_1[21:16]` (framebuffer width / 64 in pixels) |
|
||
| `pixel_psm_q` | 6 | `FRAME_1[29:24]` (pixel storage format) |
|
||
| `pixel_fb_addr_q` | 32 | Computed VRAM byte offset (see below) |
|
||
|
||
#### Address arithmetic
|
||
|
||
```
|
||
fb_addr = FBP * 2048 + (Y * FBW * 64 + X) * bytes_per_pixel
|
||
```
|
||
|
||
Ch83 added PSM-aware `bytes_per_pixel` derived from the latched
|
||
`FRAME_1[29:24]` (PSM field):
|
||
|
||
| PSM (hex) | Format | bytes/pixel | Notes |
|
||
|---|---|---|---|
|
||
| 00, 01 | PSMCT32 / PSMCT24 | 4 | host-word |
|
||
| 02, 0A | PSMCT16 / PSMCT16S | 2 | |
|
||
| 13 | PSMT8 | 1 | indexed |
|
||
| 14 | PSMT4 | 4 here (host-word) | **legacy `pixel_emit` channel only** — see note below |
|
||
| 1B, 24, 2C | PSMT8H / PSMT4HL / PSMT4HH | 4 | host-word (high/low nibble of 32-bit slot) |
|
||
| 30, 31 | PSMZ32 / PSMZ24 | 4 | depth |
|
||
| 32, 3A | PSMZ16 / PSMZ16S | 2 | depth |
|
||
| other | — | 4 (host-word fallback) | unrecognized PSM |
|
||
|
||
This table describes the **legacy `pixel_emit` channel** (the
|
||
single-pixel-per-primitive debug strobe from Ch82/Ch83). That
|
||
channel does not commit to `vram_stub`; it only emits a trace
|
||
event. Its PSMT4 entry stays at host-word fallback — the
|
||
recognition layer never tracked sub-byte position there.
|
||
|
||
The **raster channel (`raster_pixel_emit`)** does NOT use this
|
||
table. It owns its own PSM-aware emit packing in S2 with full
|
||
PSMT4 support after Ch106:
|
||
- Byte address = `pixel_index >> 1` (overrides the
|
||
`pixel_index << ras_bpp_shift` form).
|
||
- The 4-bit index from R[3:0] is placed in the targeted nibble
|
||
(low/high keyed by `pixel_index[0]`) of `write_data[7:0]`.
|
||
- `raster_pixel_be_q = 4'b0001`, `raster_pixel_mask_q = 0x0F`
|
||
or `0xF0` so `vram_stub`'s per-bit merge updates only that
|
||
nibble.
|
||
|
||
PSMT8H / PSMT4HL / PSMT4HH still address the host 32-bit slot,
|
||
not the high/low byte/nibble within it; the extracted sub-byte
|
||
is rasterizer/blit-specific and out of scope here.
|
||
|
||
`pixel_psm_q` is still exposed verbatim so consumers can apply
|
||
their own sub-slot offset arithmetic if needed.
|
||
|
||
#### Carry-forward semantics
|
||
|
||
`frame_1_q` is part of the standard GIF-context register file and
|
||
carries forward across PRIM writes (matching real GS). A stream
|
||
that sets `FRAME_1` once and then emits multiple primitives
|
||
correctly addresses all of them. A stream that never writes
|
||
`FRAME_1` lands every pixel at `fb_addr=0` — observable but not
|
||
useful, behaves cleanly under reset.
|
||
|
||
`rgbaq_q` likewise carries forward, so `pixel_color_q` reflects
|
||
the most recent RGBAQ write at emit time. If a Gouraud-style
|
||
stream rewrites RGBAQ between vertices, `pixel_color_q` captures
|
||
the closing-vertex color — same semantic as Ch79's
|
||
`prim_color_q`.
|
||
|
||
#### Strobe channel, not trace event
|
||
|
||
`pixel_emit` is a dedicated 1-cycle strobe alongside the snapshot
|
||
outputs, not a multiplexed event on the main `ev_valid` trace
|
||
stream. This avoids contention with `EV_PRIM_DRAW` on the close
|
||
cycle. A consumer that wants both can sample on `pixel_emit`
|
||
posedge and read the snapshots atomically.
|
||
|
||
### Minimal interior rasterizer (Ch84)
|
||
|
||
`gs_stub` adds a *separate* per-interior-pixel emit channel
|
||
alongside the per-primitive `pixel_emit` of Ch82. The Ch82
|
||
strobe is unchanged (still pulses once per closed primitive); the
|
||
new channel pulses once per pixel that the rasterizer determines
|
||
is inside the closed primitive's interior.
|
||
|
||
| Output | Width | Carries |
|
||
|---|---|---|
|
||
| `raster_pixel_emit` | 1 | 1-cycle strobe per emitted interior pixel |
|
||
| `raster_pixel_emit_count` | 32 | Cumulative interior pixels emitted since reset |
|
||
| `raster_pixel_x_q` / `_y_q` | 12 | Integer screen coords of the emitted pixel |
|
||
| `raster_pixel_color_q` | 64 | Per-pixel color: Gouraud-interpolated R/G/B/A for TRI/TRI_STRIP/TRI_FAN (Ch86), flat (= `prim_color_q`) for SPRITE. Q passes through from the closing vertex. |
|
||
| `raster_pixel_fb_addr_q` | 32 | Computed VRAM byte offset (PSM-aware, same math as Ch82/Ch83) |
|
||
| `raster_active` | 1 | High while the FSM is scanning a primitive |
|
||
| `raster_overflow` | 1 | Latches if a new primitive closes while the 2-entry raster FIFO is full and no concurrent pop frees a slot (Ch87 + audit-medium fix). See "Raster command queue (Ch87)" below for the back-to-back-close budget. |
|
||
| `raster_degenerate` | 1 | Latches if a TRI/STRIP/FAN closes with zero signed area (3 colinear vertices). SCAN is skipped; SPRITE never sets this. |
|
||
|
||
#### Per-primitive coverage
|
||
|
||
| `PRIM` | Raster behavior |
|
||
|---|---|
|
||
| 0 POINT | No raster emit — Ch82 closing-pixel only |
|
||
| 1 LINE | No raster emit — Ch82 closing-pixel only |
|
||
| 2 LINE_STRIP | No raster emit — Ch82 closing-pixel only |
|
||
| 3 TRIANGLE | Bounding-box scan with edge-function half-plane test |
|
||
| 4 TRI_STRIP | Same engine as TRIANGLE, fires per closed strip triangle |
|
||
| 5 TRI_FAN | Same engine as TRIANGLE, fires per closed fan triangle |
|
||
| 6 SPRITE | Bounding-box rectangle fill (every pixel inside) |
|
||
| 7 reserved | No raster emit |
|
||
|
||
#### Triangle edge-function math
|
||
|
||
For each candidate pixel `p` and each edge `(vA, vB)` of the
|
||
triangle:
|
||
|
||
```
|
||
e(p) = (p.x - vA.x) * (vB.y - vA.y) - (p.y - vA.y) * (vB.x - vA.x)
|
||
```
|
||
|
||
32-bit signed math is used to avoid overflow at typical coord
|
||
ranges.
|
||
|
||
##### Top-left fill rule (Ch85)
|
||
|
||
Adjacent triangles that share an edge would double-paint pixels
|
||
on that edge under a naïve same-sign test. Ch85 applies the
|
||
standard D3D-style top-left fill rule so each shared-edge pixel
|
||
is owned by exactly one of the two triangles.
|
||
|
||
At the IDLE→SCAN transition the FSM:
|
||
|
||
1. Computes `signed_area = (v1-v0) × (v2-v0)`.
|
||
2. If `signed_area == 0` → degenerate (3 colinear vertices);
|
||
`raster_degenerate` latches and SCAN is skipped (no
|
||
raster pixels emit). The Ch82 `pixel_emit` and `prim_complete`
|
||
pulses still fire — only the interior raster is suppressed.
|
||
3. If `signed_area < 0` → CW winding; the FSM swaps `v1` and
|
||
`v2` so the rule applies uniformly to a CCW-ordered triangle.
|
||
4. For each edge of the post-swap CCW triangle, classifies it as
|
||
*top-or-left* (inclusive) or *right/bottom* (exclusive):
|
||
- **Top edge**: horizontal going right (`dy == 0 && dx > 0`).
|
||
- **Left edge**: going down in Y-down screen (`dy > 0`).
|
||
- Anything else is a right or bottom edge.
|
||
|
||
The inside test in SCAN becomes:
|
||
|
||
```
|
||
inside = (e[i] + bias[i] <= 0) for all i in {0, 1, 2}
|
||
```
|
||
|
||
where `bias[i] = 0` if edge `i` is top-or-left and `bias[i] = 1`
|
||
otherwise. The `+1` bias converts the strict `< 0` test for
|
||
right/bottom edges into a non-strict `<= 0` test on the biased
|
||
value, keeping the math integer and uniform.
|
||
|
||
Result: for any two adjacent triangles sharing an edge, the
|
||
edge's pixels are inclusive in exactly one triangle's bias
|
||
configuration and exclusive in the other's — no double-paint.
|
||
|
||
Some shared-corner pixels may end up unpainted by either
|
||
triangle. That's the standard top-left rule trade-off:
|
||
non-overlap takes priority over coverage of every boundary
|
||
pixel.
|
||
|
||
##### Per-pixel Gouraud color (Ch86)
|
||
|
||
Triangle interior pixels now use **per-pixel Gouraud color
|
||
interpolation** instead of flat shading. The three per-vertex
|
||
colors (the same Ch80 `prim_color_v0_q` / `prim_color_v1_q` /
|
||
`prim_color_v2_q` slot mapping) are latched at SCAN init with
|
||
the same `v1↔v2` swap mirror as the vertex coords, so the
|
||
post-swap CCW vertex order matches the latched color order.
|
||
|
||
For each interior pixel `p`, barycentric weights are derived
|
||
directly from the unbiased edge functions:
|
||
|
||
```
|
||
L0(p) = -e1(p) // weight for v0 = signed area of (p, v1, v2)
|
||
L1(p) = -e2(p) // weight for v1
|
||
L2(p) = -e0(p) // weight for v2
|
||
— L0 + L1 + L2 == sa for all p inside the triangle
|
||
```
|
||
|
||
For each color channel `ch` ∈ {R, G, B, A}:
|
||
|
||
```
|
||
ch_out(p) = (L0(p)*c0.ch + L1(p)*c1.ch + L2(p)*c2.ch) / sa
|
||
```
|
||
|
||
Q (the texture-coord IEEE float in c2's upper 32 bits) is **not**
|
||
interpolated — it passes through from the closing vertex's RGBAQ
|
||
unchanged.
|
||
|
||
For a flat-shaded primitive (RGBAQ written once before all three
|
||
vertices, all three vertex colors equal), `λ0+λ1+λ2 = 1` and
|
||
the formula collapses to `c0` exactly with no rounding error —
|
||
existing flat-shaded raster TBs (raster_basic, raster_topleft)
|
||
continue to pass.
|
||
|
||
The R/G/B/A division uses **integer truncation toward zero**.
|
||
Real PS2 GS uses fixed-point with specific rounding rules; the
|
||
recognition-layer stub is intentionally simpler. SPRITE keeps
|
||
flat shading (only 2 vertices, no barycentric weights defined).
|
||
|
||
#### Sprite rectangle fill
|
||
|
||
A SPRITE has two vertices forming opposite corners. The bounding
|
||
box is computed via `min`/`max` of each axis; every pixel inside
|
||
the box is emitted in row-major order.
|
||
|
||
#### FSM and scan timing
|
||
|
||
The FSM is `IDLE` → `SCAN`. On `prim_complete_now` for an eligible
|
||
primitive, the FSM latches the vertex tuple, color, FRAME_1
|
||
fields, and bounding box, then walks the box one pixel per cycle.
|
||
For each pixel: combinational inside-test → if inside, pulse
|
||
`raster_pixel_emit` and update the snapshot. Returns to `IDLE`
|
||
when `(ras_cur_x, ras_cur_y) == (x_max, y_max)`.
|
||
|
||
Color is **Gouraud-interpolated per pixel** for triangles
|
||
(Ch86) and **flat** for sprites — see the dedicated subsections
|
||
below for the fill-rule and Gouraud math. The closing-primitive
|
||
flat color (`prim_color_q`) is still used as the SPRITE fill
|
||
color and as a backward-compat reference for flat-shaded TRIs
|
||
(when all three vertex colors are equal, the Gouraud formula
|
||
reduces to that flat value with no rounding error).
|
||
|
||
Coordinates are **integer** — the 4-bit sub-pixel of 12.4
|
||
fixed-point is discarded. Sub-pixel edge adjustment is not
|
||
modeled (top-left fill rule IS modeled — see Ch85 subsection
|
||
above).
|
||
|
||
#### Raster command queue (Ch87) and `raster_overflow`
|
||
|
||
`gs_stub` has a **2-entry FIFO** in front of the SCAN FSM. Every
|
||
primitive close that targets the rasterizer (`RM_TRI` /
|
||
`RM_SPRITE`) snapshots its full per-prim context (vertices,
|
||
bias, signed area, per-vertex colors, FRAME_1 fields, bounding
|
||
box) into the queue at the close cycle. The FSM dequeues the
|
||
oldest entry whenever it's idle or finishing a scan. Effective
|
||
concurrency is **1 in-flight + 2 queued = up to 3 back-to-back
|
||
closes** absorbed without drop.
|
||
|
||
`raster_overflow` now latches when a 4th close arrives while the
|
||
FIFO is **full** (1 in-flight, both FIFO slots occupied). The
|
||
4th primitive is dropped. Earlier chapters' bound of "1 close
|
||
mid-scan = overflow" is replaced by Ch87's "3 closes
|
||
back-to-back = OK; 4th = overflow."
|
||
|
||
Degenerate triangles are **filtered at enqueue**: they set
|
||
`raster_degenerate` and are not pushed into the queue. SPRITE
|
||
never sets `raster_degenerate`. POINT/LINE/LINE_STRIP don't
|
||
raster (RM_NONE) — they don't enqueue at all and the queue
|
||
ignores them.
|
||
|
||
Pop happens at `IDLE`→`SCAN` AND at drain-done (Ch88; see below)
|
||
when the queue has more work, so back-to-back scans run
|
||
contiguously without an `IDLE` bubble. `raster_active` stays
|
||
high across the boundary.
|
||
|
||
Real PS2 game streams emit thousands of primitives back-to-back;
|
||
3-deep concurrency is enough for most TRI_STRIP / TRI_FAN
|
||
patterns with small bounding boxes. Larger sprites or larger
|
||
triangles increase scan length and reduce headroom — a future
|
||
chapter can grow the FIFO depth.
|
||
|
||
#### Pixel pipeline (Ch88)
|
||
|
||
The SCAN body is **3 stages, throughput 1 candidate pixel per
|
||
cycle**:
|
||
|
||
| Stage | Source | Work |
|
||
|-------|--------|------|
|
||
| **S0** | `ras_cur_x` / `ras_cur_y` (bbox walker) | Generate the next candidate coord; advance the bbox walker; on bbox corner, fire `ras_at_end_of_s0` and transition R_SCAN→R_DRAIN. |
|
||
| **S1** | `s1_x_q` / `s1_y_q` (registered) | Combinational edge functions on `(s1_x, s1_y)` against the three triangle edges (or trivial-true for SPRITE), top-left bias, inside test → `s1_pixel_inside`. Latched into `s2_inside_q`. |
|
||
| **S2** | `s2_x_q` / `s2_y_q` / `s2_L0..L2_q` / `s2_inside_q` | Compute Gouraud `interp_byte(λ_i, c_i)` × 4 channels and `s2_fb_addr` from PSM/FBP/FBW. If `s2_valid_q && s2_inside_q`, drive `raster_pixel_emit` with the resolved fb_addr / x / y / color. |
|
||
|
||
`raster_state` is now a 3-state FSM:
|
||
|
||
- **R_IDLE** — no work; `pop_ok` fires on a non-empty FIFO.
|
||
- **R_SCAN** — S0 produces one valid coord per cycle; S1/S2
|
||
latches propagate. On bbox corner, transitions to R_DRAIN.
|
||
- **R_DRAIN** — S0 stops producing valids (`s1_valid_q <= 0`);
|
||
S1 and S2 finish their in-flight pixels. When both pipeline
|
||
valids are low (`drain_done`), the FSM either pops the next
|
||
primitive (back-to-back contiguous SCAN) or returns to R_IDLE.
|
||
|
||
`pop_ok = !fifo_empty && (R_IDLE || drain_done)` — the
|
||
end-of-scan pop is now drain-done, three cycles after S0
|
||
produces the corner. This guarantees the pipeline-tail pixels
|
||
of the previous primitive are not overwritten by the next
|
||
primitive's pop, while still keeping `raster_active` high
|
||
across the seam.
|
||
|
||
Latency from `pop_ok` to first registered `raster_pixel_emit`
|
||
is **3 stages of pipeline + 1 cycle of FIFO turnaround + 1
|
||
cycle of registered emit output = 5 posedges from the close
|
||
cycle of the closing vertex** (see
|
||
`sim/tb/gif_gs/tb_gs_raster_pipeline.sv` for the cycle-exact
|
||
contract).
|
||
|
||
- `EV_MODE` — fired for any accept that did not resolve to a tracked
|
||
register: REGLIST entries, IMAGE/DISABLE payload qwords, NOP-nibble
|
||
PACKED slots, unknown privileged offsets, unknown GIF reg numbers.
|
||
Reserved for "we know we saw something, we are intentionally not
|
||
modeling it yet."
|
||
|
||
- `EV_GIFTAG` — one per accepted GIFtag; carries `flg`/`nreg`/`nloop`/`eop`
|
||
for stream-level checking.
|
||
|
||
When trace event semantics change, audit this section and the per-stub
|
||
trace-schema header comments together.
|
||
|
||
#### VRAM persistence (Ch89)
|
||
|
||
`vram_stub` (`rtl/gif_gs/vram_stub.sv`) is the **first persistence
|
||
layer** the rasterizer has had. Every `raster_pixel_emit` pulse
|
||
writes 4 bytes of pixel data at `raster_pixel_fb_addr_q` into
|
||
`vram_stub`'s linear byte array. A combinational debug read port
|
||
exposes `read_data` byte-addressably so testbenches can verify
|
||
storage.
|
||
|
||
Wiring:
|
||
|
||
| vram_stub port | gs_stub source |
|
||
|---|---|
|
||
| `write_en` | `raster_pixel_emit` |
|
||
| `write_addr` | `raster_pixel_fb_addr_q` |
|
||
| `write_data` | `raster_pixel_color_q[31:0]` (the lower 32 bits — Q in the upper 32 is not framebuffer data) |
|
||
| `write_be` | `raster_pixel_be_q` (Ch95) — per-byte write enable: byte i (the byte at `write_addr + i`) is committed only when `write_en && write_be[i]`. Lets the same 32-bit write port serve PSMs of any byte width. |
|
||
| `write_mask` | `raster_pixel_mask_q` (Ch106) — per-bit merge mask: for each enabled byte, `mem[i] <= (mem[i] & ~mask_i) | (data_i & mask_i)`. Tied to `0xFFFFFFFF` for PSMs ≥ 1 byte/pixel (no behavior change). PSMT4 drives `0x0000_000F` or `0x0000_00F0` to preserve the un-targeted nibble in the same byte. |
|
||
|
||
Scope (current write-side support, after Ch105):
|
||
|
||
- **PSMCT32 + PSMCT16 + PSMT8** at the raster write port. The PSM
|
||
width is selected by `gs_stub`'s `bpp_shift` mux off
|
||
`FRAME_1.PSM` and surfaced as `raster_pixel_psm_q`; `gs_stub`'s
|
||
S2 packs the pixel into the right byte lane and drives
|
||
`raster_pixel_be_q` so `vram_stub` commits exactly the right
|
||
bytes:
|
||
- PSMCT32 (PSM=0x00) → 4 bytes/pixel, `be = 4'b1111`, ABGR in
|
||
`write_data[31:0]`.
|
||
- PSMCT16 (PSM=0x02) → 2 bytes/pixel, `be = 4'b0011`, RGB5A1
|
||
packed in `write_data[15:0]` (Ch95). `write_addr` is the
|
||
halfword byte address — per-byte `be` makes unaligned
|
||
halfword writes safe.
|
||
- PSMT8 (PSM=0x13) → 1 byte/pixel, `be = 4'b0001`, the natural
|
||
ABGR's R channel goes into `write_data[7:0]` as the PSMT8
|
||
index (Ch105). `write_addr` is the exact byte address;
|
||
`vram_stub` commits `mem[write_addr] ← write_data[7:0]` at
|
||
any byte alignment without needing data-lane shifting.
|
||
- PSMT4 (PSM=0x14) → 0.5 bytes/pixel (2 pixels per byte),
|
||
`be = 4'b0001`, `write_mask = 0x0000_000F` (low nibble) or
|
||
`0x0000_00F0` (high nibble) per `pixel_index[0]`. The 4-bit
|
||
index (low nibble of natural ABGR's R) is placed in the
|
||
targeted nibble position in `write_data[7:0]`. vram_stub
|
||
merges only the masked bits — the OTHER nibble of the same
|
||
byte is preserved (Ch106). Back-to-back same-byte emits
|
||
(e.g. PSMT4 pixels x=0 and x=1, both landing in byte 0)
|
||
chain through NBA semantics: the second NBA samples
|
||
mem[addr] AFTER the prior commit, so both nibbles end up in
|
||
the byte without a bypass-forwarding net.
|
||
- PSMCT24 / PSMCT16S / PSMZ32 / PSMZ24 / PSMZ16 / PSMZ16S /
|
||
PSMT8H / PSMT4HL / PSMT4HH — `bpp_shift` falls through to a
|
||
host-word default (4 bytes); raster emit through these PSMs
|
||
is not contract-tested.
|
||
- **Write-side addressing**. Real PS2 VRAM is 4 MiB organized
|
||
into pages × blocks × columns per PSM. By DEFAULT, both
|
||
`gs_stub` raster emit and `gif_image_xfer_stub` TRXDIR uploads
|
||
produce the linear-framebuffer layout PCSX2 calls "linear PSM".
|
||
Optional per-PSM swizzle paths gated by parameters on each
|
||
module:
|
||
* **PSMCT32**: `PSMCT32_SWIZZLE` parameter on `gs_pcrtc_stub`
|
||
(Ch120 read-side), `gif_image_xfer_stub` (Ch121 image-xfer
|
||
write-side), and `gs_stub` (Ch122 raster write-side).
|
||
* **PSMCT16**: `PSMCT16_SWIZZLE` parameter on `gs_pcrtc_stub`
|
||
(Ch126 read-side), `gif_image_xfer_stub` (Ch127 image-xfer
|
||
write-side), and `gs_stub` (Ch128 raster write-side). All
|
||
three integration points live, mirroring the PSMCT32 trio.
|
||
When on, byte addresses route through the per-PSM swizzle module
|
||
(`gs_swizzle_psmct32_stub` / `gs_swizzle_psmct16_stub`); image-xfer
|
||
adds `dest_base_q = DBP*256` on top of the swizzle output so any
|
||
DBP works, while raster emit feeds the active `ras_fbp` directly
|
||
so the swizzle output is already the absolute address. Per-PSM
|
||
parameters are independent — enabling one doesn't affect the
|
||
other PSM. **PSMT8** has its full three-path swizzle integration
|
||
as of Ch134, mirroring the PSMCT32/PSMCT16 trios: standalone
|
||
math primitive `gs_swizzle_psmt8_stub` (Ch131) wired into
|
||
`gs_pcrtc_stub` (Ch132 read-side, `PSMT8_SWIZZLE`),
|
||
`gif_image_xfer_stub` (Ch133 write-side), and `gs_stub` (Ch134
|
||
raster emit) — same parameter name on all three modules.
|
||
**PSMT4** has its full three-path swizzle integration as of
|
||
Ch140, mirroring the PSMCT32/PSMCT16/PSMT8 trios: standalone
|
||
math primitive `gs_swizzle_psmt4_stub` (Ch137) wired into
|
||
`gs_pcrtc_stub` (Ch138 read-side, `PSMT4_SWIZZLE`),
|
||
`gif_image_xfer_stub` (Ch139 write-side), and `gs_stub` (Ch140
|
||
raster emit) — same parameter name on all three modules. The
|
||
PSMT4 paths additionally thread the swizzle module's
|
||
`nibble_hi` output through the existing Ch106 (raster) /
|
||
Ch118 (image-xfer) nibble RMW machinery (replacing
|
||
`s2_pixel_index[0]` / `x_eff[0]` as the high/low nibble
|
||
selector when the gate is on). All parameter defaults are 0,
|
||
so existing TBs see the legacy linear behavior. **All four
|
||
common GS PSMs (CT32 + CT16 + T8 + T4) now have COMPLETE
|
||
three-path swizzle integration foundation.**
|
||
- **Stub-sized**. Default `BYTES = 65536`. Real VRAM is 4 MiB; for
|
||
TB purposes a small linear region is enough to verify that
|
||
emitted pixels actually land at the addresses gs_stub computes.
|
||
- **Scanout path** is provided by `gs_pcrtc_stub` (Ch90 — see
|
||
below). The legacy `platform_video_stub` flood-fills BGCOLOR
|
||
and is unaware of VRAM; TBs that want to verify the round trip
|
||
use `gs_pcrtc_stub` instead.
|
||
|
||
The Ch89 white-box TB `tb_gs_vram_writeback.sv` exercises the
|
||
contract end-to-end: drive a 4×4 SPRITE through gs_stub, capture
|
||
the (fb_addr, color) of each `raster_pixel_emit` pulse, then
|
||
read each fb_addr back from `vram_stub` and assert byte-exact
|
||
match.
|
||
|
||
#### PCRTC scanout (Ch90)
|
||
|
||
`gs_pcrtc_stub` (`rtl/gif_gs/gs_pcrtc_stub.sv`) is the **scanout
|
||
side** of the GS pipeline — its dual is `gs_stub` (the write
|
||
side). It models a minimal PCRTC (Programmable CRT Controller):
|
||
runs its own raster timing, generates a VRAM read address from
|
||
the current `(hcnt, vcnt)` using the same fb_addr math as
|
||
gs_stub, reads the byte returned by `vram_stub`'s combinational
|
||
debug port, and drives `r`/`g`/`b` for the active area. Together
|
||
with Ch88's pipeline + Ch89's VRAM, this closes the loop:
|
||
|
||
```
|
||
raster_pixel_emit → vram_stub.write → vram_stub.read → pcrtc.r/g/b
|
||
```
|
||
|
||
Configuration (Ch91 — privileged-block CPU MMIO):
|
||
|
||
`gs_pcrtc_stub` consumes two real PS2 GS privileged display
|
||
register latches directly from `gs_stub`:
|
||
|
||
| pcrtc input | gs_stub source | Layout |
|
||
|---|---|---|
|
||
| `pmode_q[63:0]` | privileged write at offset 0x0000 | bit 0 = EN1 (display 1 enable) |
|
||
| `dispfb1_q[63:0]` | privileged write at offset 0x0070 | FBP[8:0], FBW[14:9], PSM[19:15], DBX[42:32] (Ch91-audit), DBY[53:43] (Ch91-audit) |
|
||
| `display1_q[63:0]` (Ch92, Ch93) | privileged write at offset 0x0080 | DX[11:0], DY[22:12], MAGH[26:23] (Ch93 — H scale = MAGH+1), MAGV[28:27] (Ch93 — V scale = MAGV+1), DW[43:32] (width-1), DH[54:44] (height-1) |
|
||
|
||
The Ch90 sideband ports (`scanout_enable` / `dispfb_fbp` /
|
||
`dispfb_fbw`) are **gone**. TBs program scanout the way a real
|
||
PS2 driver would: write DISPFB1, then DISPLAY1, then PMODE.EN1=1
|
||
(Ch92). Out of reset, all three registers are 0, so EN1 is low
|
||
and pcrtc outputs 0.
|
||
|
||
`scanout_enable` inside pcrtc is derived combinationally from
|
||
the latches:
|
||
`scanout_enable = pmode_q[0] & (PSM ∈ {0, 2, 0x13, 0x14})`.
|
||
PSMCT32 (=0), PSMCT16 (=2), PSMT8 (=0x13), and PSMT4 (=0x14) are
|
||
honored at this scope; any other PSM forces scanout off rather
|
||
than mis-decoding the byte layout.
|
||
|
||
DISPLAY1 (Ch92, Ch93) supplies the **display window** — the
|
||
sub-rect inside the active area where pcrtc actually pulls
|
||
pixels from VRAM — and the **per-axis magnification**: each
|
||
VRAM column is shown for (MAGH+1) consecutive VCK pulses, each
|
||
VRAM line for (MAGV+1) raster lines. Outside the window pcrtc
|
||
drives r/g/b = 0 even with EN1=1. Pcrtc's H_TOTAL/V_TOTAL still
|
||
come from module parameters at instantiation; only the
|
||
active-area sub-rect gated by DX/DY/DW/DH is register-driven.
|
||
Dual-display (PMODE.EN2 + DISPFB2 + DISPLAY2) is deferred.
|
||
|
||
Address math + display-window gating + magnification:
|
||
|
||
```
|
||
hmag_factor = MAGH + 1 // 1..16
|
||
vmag_factor = MAGV + 1 // 1..4
|
||
hwin_rel = hcnt - DX // pixel offset inside the window
|
||
vwin_rel = vcnt - DY
|
||
in_window = (hcnt >= DX) && (hwin_rel <= DW)
|
||
&& (vcnt >= DY) && (vwin_rel <= DH)
|
||
fbp_bytes = dispfb_fbp << 11 // FBP × 2048
|
||
pixels_per_row = dispfb_fbw << 6 // FBW × 64
|
||
vram_x_unshift = hwin_rel / hmag_factor // 4 displayed pixels = 1 VRAM column at MAGH=3
|
||
vram_y_unshift = vwin_rel / vmag_factor
|
||
effective_x = vram_x_unshift + DBX
|
||
effective_y = vram_y_unshift + DBY
|
||
pixel_index = effective_y × pixels_per_row + effective_x
|
||
bpp_shift = (PSM == PSMCT32) ? 2 :
|
||
(PSM == PSMCT16) ? 1 :
|
||
(PSM == PSMT8) ? 0 : 2
|
||
fb_addr = fbp_bytes + (pixel_index << bpp_shift)
|
||
r/g/b drive = (de && scanout_enable && in_window) ? decode(VRAM, PSM) : 0
|
||
```
|
||
|
||
Per-PSM color decode at `vram_read_data`:
|
||
|
||
- **PSMCT32**: `r = data[7:0]`, `g = data[15:8]`, `b = data[23:16]`. Alpha at `[31:24]` is dropped (no DAC channel).
|
||
- **PSMCT16** (Ch94): RGB5A1 packed into the lower 16 bits as `{A[15], B[14:10], G[9:5], R[4:0]}`. 5→8 expansion uses bit-replicate `r8 = {r5, r5[4:2]}` (so 5'h1F → 8'hFF, 5'h00 → 8'h00). Alpha bit dropped at the DAC.
|
||
- **PSMT8** (Ch96/Ch97): index in `data[7:0]`. With `clut_enable=1` (Ch97), pcrtc presents `clut_read_idx = idx + (CSA << 4)` to the external `clut_stub` and decodes the returned PSMCT32 entry as `r = clut_data[7:0]`, `g = clut_data[15:8]`, `b = clut_data[23:16]`. With `clut_enable=0` (Ch96 fallback), pcrtc surfaces the index as grayscale so the 8-bit storage lane is visually verifiable without programming a CLUT.
|
||
- **PSMT4** (Ch103): 2 pixels per byte. `byte_offset = pixel_index >> 1` (overrides the standard `pixel_index << bpp_shift` math). `nibble = pixel_index[0] ? data[7:4] : data[3:0]` picks the 4-bit pixel; the zero-extended 8-bit value `{4'd0, nibble}` plus `(CSA << 4)` is presented on `clut_read_idx`. With `clut_enable=1`, pcrtc decodes the returned PSMCT32 entry the same way as PSMT8. With `clut_enable=0`, the fallback replicates the nibble across the 8-bit DAC value (`r = g = b = {nibble, nibble}`) so 4'hF → 0xFF and 4'h5 → 0x55. CSA is the natural per-palette-window selector for PSMT4 — multiple 16-entry palettes can share the 256-entry staging area, indexed by CSA.
|
||
|
||
**Ch95 — gs_stub raster channel emits PSMCT16**. The S2 stage
|
||
of the pipeline now packs ABGR → RGB5A1 (`r5=R[7:3]`, `g5=G[7:3]`,
|
||
`b5=B[7:3]`, `a1=A[7]`) when `ras_bpp_shift==1` (PSMCT16 / PSMCT16S
|
||
/ PSMZ16 / PSMZ16S — any 16-bit PSM). The packed 16-bit pixel
|
||
goes in the LOW halfword of `raster_pixel_color_q[31:0]`, and a
|
||
new `raster_pixel_be_q[3:0]` selects which bytes vram_stub
|
||
commits: `4'b0011` for PSMCT16, `4'b1111` for PSMCT32. vram_stub
|
||
gates each byte write on `write_be[i]`, so back-to-back PSMCT16
|
||
emits write 2 bytes each without halfword stomping. New
|
||
`raster_pixel_psm_q[5:0]` exposes the current PSM for trace.
|
||
|
||
The Ch95 TB `tb_gs_raster_psmct16.sv` exercises the round trip:
|
||
gs_stub renders a 4×4 SPRITE with FRAME_1.PSM=PSMCT16, then VRAM
|
||
read-back verifies each pixel landed at the right halfword AND
|
||
that the halfword right after the sprite stays zero (no leak).
|
||
|
||
Ch105 extends the raster channel to PSMT8 (FRAME_1.PSM=0x13).
|
||
When `ras_bpp_shift==0`, S2 takes the natural ABGR's R channel
|
||
(low 8 bits) as the PSMT8 index — the same lane real PS2 hardware
|
||
writes when the destination FB is PSMT8 — places it in the LOW
|
||
byte of the emit lane, and sets `raster_pixel_be_q = 4'b0001` so
|
||
vram_stub commits exactly the 1 byte at fb_addr. The 1-byte
|
||
commit works at any byte alignment because vram_stub gates each
|
||
byte lane independently. The Ch105 TB `tb_gs_raster_psmt8.sv`
|
||
renders a 5×3 SPRITE (chosen so the row spans byte lanes 1, 2, 3,
|
||
0, 1 — exercising every lane alignment) at FRAME_1.PSM=PSMT8 with
|
||
RGBAQ R=0x55, G=0xAA, B=0xBB, A=0xCC; asserts each sprite byte
|
||
reads back as 0x55, the bytes immediately left and right of the
|
||
sprite stay 0x00 (so 32-bit-aligned overwrite would be visible),
|
||
and a full-VRAM sweep finds NO byte equal to 0xAA / 0xBB / 0xCC
|
||
(channel-isolation: only R reaches VRAM at PSMT8).
|
||
|
||
Ch106 closes the indexed-write gap with PSMT4 (FRAME_1.PSM=0x14)
|
||
as a per-bit RMW into `vram_stub`. Three changes form the
|
||
mechanism:
|
||
|
||
1. `vram_stub` gains a new `write_mask[31:0]` input (Ch106). The
|
||
commit is now `mem[i] <= (mem[i] & ~mask_i) | (data_i & mask_i)`
|
||
for each enabled byte. PSMCT32/16/PSMT8 tie mask=`0xFFFF_FFFF`
|
||
(no behavior change — full byte writes).
|
||
2. `gs_stub`'s S2 PSM-aware emit packing gets a PSMT4 branch:
|
||
the byte address is `pixel_index >> 1` (overrides the
|
||
`pixel_index << ras_bpp_shift` form), the index is the low
|
||
4 bits of the natural ABGR's R channel, and the emit places
|
||
that 4-bit value in either the low (`{4'd0, idx}`) or high
|
||
(`{idx, 4'd0}`) nibble of `write_data[7:0]` based on
|
||
`pixel_index[0]`. `s2_emit_be = 4'b0001`,
|
||
`s2_emit_mask = pixel_index[0] ? 0x0000_00F0 : 0x0000_000F`.
|
||
3. New `raster_pixel_mask_q[31:0]` output on `gs_stub` carries
|
||
the mask through to `vram_stub.write_mask`.
|
||
|
||
The Ch106 TB `tb_gs_raster_psmt4.sv` is intentionally
|
||
adversarial about preservation. VRAM is preloaded with `0xA5`
|
||
(high=A, low=5) at every byte the sprites will touch. Three
|
||
phases:
|
||
|
||
- **Phase A**: 4×2 SPRITE at (0,0)..(3,1), R=0x05 → idx=5. Both
|
||
nibbles of each enclosing byte are written (8 emits across 4
|
||
bytes); each byte ends at `0x55` and the four neighbouring
|
||
preloaded bytes (2..3, 34..35) remain `0xA5`. This proves the
|
||
back-to-back same-byte case (NBA chaining) and the neighbour-
|
||
byte preservation in one go.
|
||
- **Phase B**: single-pixel SPRITE at (5, 2). x=5 odd → high
|
||
nibble; pixel_index = 133, byte_addr = 66; idx=7. Preload
|
||
`mem[66] = 0xA5`. Expected after raster: `mem[66] = 0x75` —
|
||
high nibble updated from A to 7, low nibble stays 5. Proves
|
||
isolated high-nibble RMW preserves the low nibble.
|
||
- **Phase C**: single-pixel SPRITE at (4, 3). x=4 even → low
|
||
nibble; pixel_index = 196, byte_addr = 98; idx=9. Preload
|
||
`mem[98] = 0xA5`. Expected after: `mem[98] = 0xA9` — low
|
||
nibble updated from 5 to 9, high nibble stays A. Proves
|
||
isolated low-nibble RMW preserves the high nibble.
|
||
|
||
Continuous observer asserts `psm_q == 6'h14`, `be_q == 4'b0001`,
|
||
and `mask_q ∈ {0x0F, 0xF0}` on every emit. Final aggregate
|
||
checks: 10 emits total, full-VRAM sweep finds NO byte equal to
|
||
0xAA / 0xBB / 0xCC (only R reaches the framebuffer at PSMT4).
|
||
|
||
DBX / DBY shift the VRAM origin: the pixel that appears at
|
||
displayed (DX, DY) corresponds to VRAM (DBX, DBY). Real PS2
|
||
drivers use this for double-buffered framebuffers (alternate
|
||
frames at different DBX/DBY) and offset display windows.
|
||
|
||
Five TBs lock these contracts:
|
||
|
||
- `tb_gs_scanout_basic.sv` — DBX=DBY=0, DISPLAY1 covers full
|
||
active area, MAGH=MAGV=0 (1×): classic sprite-at-origin scanout.
|
||
- `tb_gs_scanout_dbx_dby.sv` — sprite at VRAM (4,2)..(7,5),
|
||
DISPFB1.DBX=4/DBY=2, DISPLAY1 full active area, MAGH=MAGV=0:
|
||
sprite shows at displayed (0..3, 0..3).
|
||
- `tb_gs_scanout_display_window.sv` — sprite at VRAM (0..3, 0..3),
|
||
DBX=DBY=0, DISPLAY1 with DX=2/DY=1/DW=3/DH=3, MAGH=MAGV=0:
|
||
sprite shows at displayed (2..5, 1..4); pixels outside the
|
||
window are black even though pcrtc's raster passes through them.
|
||
- `tb_gs_scanout_magh_magv.sv` (Ch93) — sprite at VRAM (0..3, 0..3),
|
||
DBX=DBY=0, DISPLAY1 with DX=4/DY=2/DW=7/DH=7, MAGH=1/MAGV=1
|
||
(2×/2×): 4×4 VRAM sprite stretches to fill the 8×8 displayed
|
||
window pixel-perfect; pixels outside the window are black.
|
||
- `tb_gs_scanout_psm16.sv` (Ch94) — 4×4 RGB5A1 sprite written
|
||
directly to vram_stub at PSMCT16 byte stride, DISPFB1.PSM=0x02:
|
||
5→8 bit-replicate decode produces the right (R8, G8, B8) at
|
||
scanout. (No gs_stub instantiated; this TB exercises the PSM
|
||
decode path in isolation.)
|
||
- `tb_gs_scanout_psmt8.sv` (Ch96) — 4×4 PSMT8 sprite of indices
|
||
0x10..0x1F written directly to vram_stub at 1 byte/pixel
|
||
stride. DISPFB1.PSM=0x13, DISPLAY1 with DX=4/DY=2/DW=7/DH=3
|
||
AND MAGH=1 (2× horizontal). Asserts each scan-out displayed
|
||
pixel reads back as grayscale R=G=B=expected index, proving
|
||
byte stride + display window + horizontal magnification all
|
||
work at 1 byte/pixel.
|
||
- `tb_gs_scanout_psmt8_clut.sv` (Ch97) — same 4×4 PSMT8 sprite,
|
||
plus a programmed CLUT where `CLUT[i] = ABGR(0xFF, i+0x80, i+0x40, i)`.
|
||
With `clut_enable=1` and `clut_csa=0`, asserts each scan-out
|
||
pixel reads back as the CLUT entry for its index — PSMT8
|
||
storage + CLUT lookup compose correctly into real RGB. Three
|
||
phases: full-frame CSA=0, single-pixel CSA=1 (idx 0x00 →
|
||
CLUT[0x10]), and CSA=1 wrap (idx 0xF8 → CLUT[0x08]).
|
||
- `tb_gs_tex0_clut.sv` (Ch98) — drives gs_stub's GIF reg# 0x06
|
||
(TEX0_1) and asserts the latch + sub-field decoders match the
|
||
encoded payload (CBP/CPSM/CSM/CSA/CLD bit ranges). Phase 2
|
||
wires `pcrtc.clut_csa` from `gs_stub.tex0_1_csa_q` (instead
|
||
of TB-side sideband) and verifies the CSA value flows from a
|
||
GIF register write into the CLUT lookup math at scan-out.
|
||
- `tb_gs_clut_load.sv` (Ch99) — full TEX0.CLD-driven VRAM→CLUT
|
||
load round trip. Stages 256 PSMCT32 entries in VRAM at
|
||
`CBP*256` (using the new `vram_stub` second read port), drives
|
||
TEX0_1 with `CBP=4, CPSM=PSMCT32, CSM=CSM2, CLD=1`, waits for
|
||
`clut_loader_stub.load_busy` to fall, then runs PSMT8 scanout
|
||
and asserts each in-sprite pixel reads back as the CLUT entry
|
||
the loader copied — no TB-direct CLUT writes needed. Also
|
||
carries a Ch99-audit negative phase: a TEX0 write with CSM=0
|
||
(CSM1 swizzle, deferred) silently no-ops instead of laying
|
||
down wrong linear bytes.
|
||
- `tb_gs_clut_load_ct16.sv` (Ch100) — CPSM=PSMCT16 variant of the
|
||
Ch99 load TB. Stages 256 RGB5A1 entries (2 bytes each) in VRAM
|
||
at `CBP*256`, drives TEX0_1 with `CPSM=2`. The loader now
|
||
walks at 2-byte stride, unpacks RGB5A1 → PSMCT32 ABGR via 5→8
|
||
bit-replicate, and writes to clut_stub. PSMT8 scanout produces
|
||
the expanded RGB. Ch100-audit alpha coverage: per-entry `a1 = idx[0]`
|
||
varies the alpha bit so both `{8{0}} = 0x00` and `{8{1}} = 0xFF`
|
||
are exercised; a TB-side `clut_we` snoop captures every loader
|
||
write so alpha can be asserted directly without going through
|
||
the RGB-only scanout path.
|
||
- `tb_gs_clut_load_cld_modes.sv` (Ch101 + Ch102) — conditional
|
||
CLD-mode policy. Phases walk through CLD ∈ {0, 1, 2, 3, 4, 5,
|
||
6, 7} with varying CBP/CPSM/CSA, counting `loader_busy` rising
|
||
edges to prove: CLD=0 never loads; CLD=1 always (full); CLD=2
|
||
loads only when CBP changed; CLD=3 loads when CBP/CPSM/CSA
|
||
any-changed (CBP, CSA, and CPSM arms each isolated); CLD=4
|
||
always loads but only the 16-entry CSA window (Ch102 — write
|
||
range correctness is locked by `tb_gs_clut_load_csa_window`);
|
||
CLD ∈ {5, 6, 7} reserved no-ops.
|
||
- `tb_gs_clut_load_csa_window.sv` (Ch102) — CLD=4 write-range
|
||
correctness. Phase 1 stages 256 distinct PSMCT32 entries in
|
||
VRAM and runs CLD=1 to fill all 256 CLUT slots with pattern_a.
|
||
Phase 2 stages 16 different entries at a new CBP, drives CLD=4
|
||
with CSA=2 (window = idx 32..47), and asserts via a `clut_we`
|
||
snoop that exactly 16 writes occurred AND the captured array
|
||
contains: pattern_a(i) at i ∈ [0, 32) ∪ [48, 256), pattern_b(i-32)
|
||
at i ∈ [32, 48). Proves 240 entries are preserved across the
|
||
partial load. Audit-low extensions: Phase 3 covers the
|
||
high-CSA wrap (CSA=16 → window-base wraps mod-256 to 0); Phase
|
||
4 covers CT16 partial (CPSM=PSMCT16, 2-byte stride, RGB5A1
|
||
unpack at the loader, window at idx 160..175).
|
||
- `tb_gs_scanout_psmt4_clut.sv` (Ch103) — PSMT4 scanout. Stages
|
||
a 4×4 PSMT4 sprite (2 pixels/byte) and 16 CLUT entries.
|
||
Phase 1 (`clut_enable=1`): asserts each pixel reads
|
||
`CLUT[zero-ext(nibble) + CSA*16]`. Phase 2 (`clut_enable=0`):
|
||
asserts the grayscale fallback replicates the 4-bit nibble
|
||
across the 8-bit DAC value. Both phases verify byte-stride
|
||
half-extraction (low/high nibble pick) at every active pixel.
|
||
Audit-low Phase 3 locks PSMT4 + nonzero CSA (CSA=1, window
|
||
16..31) end-to-end: TB-direct CLUT writes plant a 0xDEAD_BEEF
|
||
sentinel at entries 0..15 and a per-index formula at 16..31,
|
||
scanout asserts each pixel reads the formula and never the
|
||
sentinel.
|
||
- `tb_gs_demo_psmt4_e2e.sv` (Ch107) — first end-to-end demo for
|
||
the GS/PCRTC stack. **Scope is GS-side only**: the post-GIF
|
||
register stream (per-reg A+D writes via `gs_stub.gif_reg_*`)
|
||
plus privileged-block MMIO drive the pipeline; `gif_packed_stub`
|
||
/ GIFtag-PACKED is BYPASSED — feeding the same demo through
|
||
the GIF front-end is a future chapter. Step 1 stages 16
|
||
PSMCT32 palette entries in VRAM at `CBP*256` (modelled as a
|
||
TB-direct write — DMA→GS image transfer is a future chapter,
|
||
but the framebuffer itself is NOT TB-direct). Step 2 drives
|
||
per-reg writes (PRIM/FRAME_1/RGBAQ/XYZ2) for four SPRITEs
|
||
paying out a 4-quadrant 8×4 image (TL idx 0x5, TR idx 0x7,
|
||
BL idx 0xA, BR idx 0xC) at FRAME_1.PSM=PSMT4 — all 32
|
||
framebuffer pixels arrive via the Ch106 raster channel.
|
||
Step 3 drives TEX0_1 with `CBP=palette, CPSM=PSMCT32,
|
||
CSM=CSM2, CSA=0, CLD=4`; loader writes clut_stub[0..15].
|
||
Step 4 brings up scanout via privileged-block writes to
|
||
DISPFB1 (PSM=PSMT4) + DISPLAY1 + PMODE.EN1. Step 5 captures
|
||
one full frame and asserts each pixel reads back as
|
||
`CLUT[quadrant_idx]` (or `CLUT[0]` outside the 8×4 image
|
||
since vram_stub zero-init means nibble=0). Aggregate asserts:
|
||
32 PSMT4 emits, mask ∈ {0x0F, 0xF0} on every emit
|
||
(channel-isolation locked architecturally — only R[3:0] ever
|
||
reaches VRAM at PSMT4), loader fires exactly once, no
|
||
raster_overflow / raster_degenerate. This TB is the first
|
||
stack-wide proof that the GS-side post-GIF sequence —
|
||
per-reg writes → indexed framebuffer → TEX0+CLD palette
|
||
upload → PMODE/DISPFB/DISPLAY scanout — produces a coherent
|
||
RGB frame end to end without TB sideband for the framebuffer
|
||
pixels. Routing the same primitives through GIFtag/PACKED A+D
|
||
via `gif_packed_stub` closes the last sideband and is the
|
||
natural Ch108 anchor.
|
||
- `tb_gs_demo_psmt4_e2e_ee_full_bootlet.sv` (Ch114) — extends
|
||
Ch113's EE-driven control plane to ALSO drive the DMAC
|
||
channel-2 setup from the same MIPS instruction stream. The EE
|
||
program now writes the 4 GS-priv registers + the 3 DMAC ch2
|
||
registers (MADR / QWC / CHCR.start) via real `sw`
|
||
instructions, then SYSCALLs to halt. Total: 7 EE-CPU MMIO
|
||
writes (4 GS-priv + 3 DMAC) producing the same 16×8 captured
|
||
frame. **Architectural note**: the program lives in
|
||
`bios_rom_stub` at 0xBFC0_0000 / phys 0x1FC0_0000, NOT in
|
||
RAM. A RAM-resident program would have its instruction
|
||
fetches contend with the DMAC's RAM reads through
|
||
`ee_ram_stub`'s single read port (the map's CPU>DMAC
|
||
arbitration silently corrupts DMAC data). Putting the program
|
||
in BIOS decouples the two paths so EE and DMAC run truly in
|
||
parallel. This also matches real PS2: the EE boots out of
|
||
BIOS ROM. PASS criteria add to Ch113's: **3 EE-driven DMAC
|
||
writes** seen at the map's DMAC-ch2 decode; the existing
|
||
`dma=(1,36,1)` event taxonomy still holds (those events are
|
||
triggered by the EE's CHCR write, not a TB-direct write).
|
||
The remaining TB-direct surfaces in the demo are now narrowly
|
||
the GIF payload pre-stage in RAM (a real EE driver would
|
||
itself stage this) and bios_rom_stub's program preload (which
|
||
is the EE bootlet itself — not a runtime TB sideband).
|
||
- `tb_gs_demo_psmt4_e2e_ee_program.sv` (Ch113) — same demo as
|
||
Ch112 but the 4 control-plane MMIO writes (PMODE / DISPFB1 /
|
||
DISPLAY1 lo / DISPLAY1 hi) are no longer issued by the TB
|
||
directly. Instead a 10-instruction MIPS program preloaded into
|
||
ee_ram_stub at phys 0x800 (kseg0 0x80000800) is fetched and
|
||
executed by `ee_core_stub` (parameterized with
|
||
`PC_RESET=0x80000800`). The program is `LUI/ORI/SW × 4` plus a
|
||
SYSCALL terminator; the SW instructions target `0x12000000+`
|
||
and flow through `ee_memory_map_stub`'s GS-priv decode →
|
||
`ee_gs_priv_bridge_stub` → `gs_stub.reg_wr_*`. Closes the
|
||
very last TB-direct surface in the demo flow: every byte AND
|
||
every register bit AND every control-plane decision now
|
||
arrives from a real-shape source. PASS criteria add to
|
||
Ch112's: `core_halt_o == 1` (asserts exactly once on the
|
||
SYSCALL halt), `core_trap == 0`, EE program halts at
|
||
`EE_PROG_VA + 36 = 0x80000824` (the SYSCALL slot). The TB
|
||
still pre-stages the GIF payload and triggers the DMAC
|
||
channel-2 transfer via TB-direct CHCR/MADR/QWC writes — a
|
||
wider EE program that also drives DMAC bring-up is a
|
||
separate future chapter.
|
||
- `tb_gs_demo_psmt4_e2e_eemap.sv` (Ch112) — same demo as Ch111
|
||
but the bridge is no longer driven by the TB directly. Instead
|
||
the TB drives `ee_memory_map_stub.ee_wr_*` with full 32-bit
|
||
physical addresses targeting the new GS-privileged-MMIO window
|
||
at 0x1200_0000-0x1200_FFFF (64 KiB; phys[28:16] == 13'h1200).
|
||
The map decodes the window, peels the 16-bit offset, and hands
|
||
the 32-bit half-write to `ee_gs_priv_bridge_stub`, which then
|
||
fires gs_stub.reg_wr_* with the running 64-bit shadow value.
|
||
Closes the last control-plane routing gap before a real EE
|
||
instruction stream can drive the demo's bring-up: PMODE /
|
||
DISPFB1 / DISPLAY1 are now reachable from `sw 0x1200_0080(...)`-
|
||
shaped writes rather than from a TB-shaped EE-MMIO port.
|
||
PASS criteria identical to Ch111: 4 EE-MMIO writes / 4 bridge
|
||
fires, same 16×8 captured frame. **Architectural note**: this
|
||
chapter ALSO adds 4 new output ports to `ee_memory_map_stub`
|
||
(`ee_gs_priv_wr_en/addr/data/be`). Existing 56 ee_memory_map_
|
||
stub-using TBs leave those outputs unconnected (named-port
|
||
instantiation tolerates omitted outputs); only the new Ch112
|
||
TB wires them through to the bridge.
|
||
- `tb_gs_demo_psmt4_e2e_eemmio.sv` (Ch111) — same demo as
|
||
Ch110 but the privileged-block control writes (PMODE / DISPFB1
|
||
/ DISPLAY1) now arrive through `ee_gs_priv_bridge_stub` (a new
|
||
RTL module) driven by EE-shaped 32-bit MMIO writes from the
|
||
TB, instead of TB-direct gs_stub.reg_wr_* pulses. The bridge
|
||
accumulates 32-bit half-writes per 8-byte slot and fires a
|
||
64-bit gs_stub.reg_wr_* pulse on each EE half-write —
|
||
single-half writes work for PMODE.EN1 and DISPFB1 (interesting
|
||
bits in the low 32), and a pair of writes (lo+hi to
|
||
consecutive 4-byte addresses) handles DISPLAY1 whose DW/DH
|
||
live in the high 32. **Bridge contract**: full-word writes
|
||
only — `ee_wr_be` must be `4'b1111`; sub-word (per-byte)
|
||
merging into the 64-bit shadow is intentionally out of scope
|
||
and a `$error` fires on any narrower be (control-plane GS
|
||
registers are always written as full 32-bit `sw` halves of an
|
||
`sd`). **Scope precision**: this chapter closes the TB-direct
|
||
`gs_stub.reg_wr_*` surface — i.e., the privileged-MMIO sink at
|
||
the GS itself. The bridge is instantiated by the TB directly;
|
||
it is NOT yet wired into `ee_memory_map_stub`, so the full
|
||
EE-CPU / memory-map MMIO path (a real EE instruction stream
|
||
reaching 0x12000000+ via `sw`) is a separate future chapter.
|
||
PASS criteria add to Ch110's: **4 EE-MMIO writes** (1 PMODE +
|
||
1 DISPFB1 + 2 DISPLAY1) and **4 bridge fires** producing the
|
||
same 16×8 captured frame as Ch110.
|
||
- `tb_gs_demo_psmct32_swizzle_trxdir_e2e.sv` (Ch124) — companion
|
||
to Ch123: same EE-bootlet → DMAC → GIF data plane and same all-
|
||
three-gates-on instantiation, but the framebuffer is filled by
|
||
a TRXDIR/IMAGE upload through `gif_image_xfer_stub` instead of
|
||
by raster. The Ch121 image-xfer write-side swizzle gate becomes
|
||
LOAD-BEARING inside the demo flow — every byte the GS produces
|
||
comes out of the image-xfer engine at canonical PSMCT32
|
||
swizzled addresses, and the raster path is dormant. Payload:
|
||
U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=1, DPSM=PSMCT32} /
|
||
TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} / TRXDIR{XDIR=0})
|
||
+ U2 (IMAGE, NLOOP=32: 32 IMAGE qwords carrying the 128 PSMCT32
|
||
pixels of the same four-quadrant pattern Ch123 used). DMAC QWC
|
||
= 38. Verification mirrors Ch123: (1) full 16×8 scanout frame
|
||
capture; (2) per-pixel byte readback at the canonical swizzled
|
||
address via vram_stub's 2nd read port; (3) strict linear-vs-
|
||
swizzled separator at byte 1024 stays 0. Aggregate counts:
|
||
`dma=(1,38,1) ee_dmac_wr=3 giftags=2 ad_writes=4
|
||
xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1
|
||
emits=0 frame=16x8`. Ch123 + Ch124 together exercise BOTH
|
||
PSMCT32 write-side paths (raster Ch122 + image-xfer Ch121)
|
||
end-to-end through the same driver-shaped flow with the
|
||
same swizzled-scanout (Ch120) read side.
|
||
- `tb_gs_demo_psmct32_swizzle_e2e.sv` (Ch123) — full driver-shaped
|
||
end-to-end demo with ALL THREE PSMCT32 swizzle gates flipped
|
||
on simultaneously: `gs_stub#(PSMCT32_SWIZZLE=1)` (Ch122 raster),
|
||
`gif_image_xfer_stub#(PSMCT32_SWIZZLE=1)` (Ch121 — instantiated
|
||
but unused in this demo), `gs_pcrtc_stub#(PSMCT32_SWIZZLE=1)`
|
||
(Ch120 read). The data plane is the same DMAC + GIF + EE-bootlet
|
||
shape Ch107..Ch114 demos use: a BIOS-resident EE program
|
||
(PC_RESET=0xBFC0_0000) configures GS-priv (DISPFB1, DISPLAY1
|
||
lo/hi, PMODE.EN1) via `sw` instructions through
|
||
`ee_memory_map_stub` → `ee_gs_priv_bridge_stub` →
|
||
`gs_stub.reg_wr_*`, then kicks DMAC ch2 (MADR / QWC / CHCR)
|
||
via `sw` to the DMAC reg window, then `SYSCALL` halts. DMAC
|
||
delivers a 24-qword payload from `ee_ram_stub` to
|
||
`gif_packed_stub`, which dispatches 4 SPRITE PACKED packets
|
||
(1 GIFtag + 5 A+D each — PRIM, FRAME_1=PSMCT32, RGBAQ, XYZ2,
|
||
XYZ2). The 4 sprites tile the 16×8 active area into 4 quadrants
|
||
with unique RGB triples. With the raster gate on, all 128
|
||
per-pixel store addresses go through `gs_swizzle_psmct32_stub`;
|
||
with the pcrtc gate on, scanout reads from those same swizzled
|
||
addresses. **Two-phase verification**: (1) **scanout** — every
|
||
(x, y) in 16×8 captures its sprite's RGB; (2) **byte readback
|
||
via vram_stub's 2nd read port** — for every (x, y), the 32-bit
|
||
word at `ref_addr_psmct32(0, 1, x, y)` equals the sprite's
|
||
`{A=0xFF, B, G, R}` PSMCT32 word. Strict linear-vs-swizzled
|
||
separator at byte 1024 (where the linear formula's y=4 row
|
||
would land at stride=256) stays 0 — the swizzled write set
|
||
for the 16×8 image stays in blocks (0,0) and (1,0) of page 0
|
||
(bytes 0..511), so a fall-through to linear would have placed
|
||
sprite-2's color at byte 1024. Aggregate counts:
|
||
`dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20 xfer_writes=0
|
||
ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=128 frame=16x8`.
|
||
This is the FIRST end-to-end demo where every PSMCT32 byte
|
||
the GS produces lives at the canonical PCSX2 swizzled address
|
||
AND the scanout reads from it — byte-accurate to real PS2
|
||
VRAM layout, end-to-end through the driver-shaped flow.
|
||
- `tb_gs_raster_swizzle_psmct32.sv` (Ch122) — focused contract
|
||
for the new `PSMCT32_SWIZZLE` parameter on `gs_stub`. When the
|
||
parameter is set to 1 AND the active raster PSM is PSMCT32
|
||
(`ras_psm == 6'h00`), the per-pixel raster emit address is
|
||
routed through the Ch119 `gs_swizzle_psmct32_stub` (FBP=ras_fbp,
|
||
FBW=ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) and its output is
|
||
the absolute byte address (FBP*2048 already baked in).
|
||
At Ch122 only, PSMCT16/PSMT8/PSMT4 raster emits always took
|
||
the linear path. Ch128 later closed the PSMCT16 raster gate
|
||
and Ch134 closed the PSMT8 raster gate (each with its own
|
||
per-PSM parameter on this same `gs_stub`); PSMT4 raster still
|
||
takes the linear path. Default 0 keeps every existing PSMCT32
|
||
raster TB unchanged.
|
||
**Three-phase verification**: (1) **origin SPRITE** — drive a
|
||
single 16×4 SPRITE at FRAME_1{FBP=0, FBW=1, PSMCT32} with RGBAQ
|
||
R=0x55/G=0xAA/B=0xCC/A=0x77, expect 64 emits, per-pixel byte
|
||
readback via vram_stub's 2nd read port at swizzled addresses
|
||
confirms each pixel lands where the swizzle says. Strict
|
||
linear-vs-swizzled separators at bytes 512 and 768 (the linear
|
||
formula's y=2 / y=3 row starts) stay 0 — proves the gate is
|
||
live. (2) **scanout agreement** — enable the Ch120 swizzled-
|
||
pcrtc path on the same VRAM contents, capture the full 16×4
|
||
frame, assert each visible pixel reads back the SPRITE's RGB.
|
||
Both gs_stub (Ch122 raster) and gs_pcrtc_stub (Ch120 scanout)
|
||
instantiate the same swizzle module; a successful capture
|
||
proves the two integrations agree at byte level — what raster
|
||
wrote at swizzled addresses comes out on r/g/b at the same
|
||
(x, y). (3) **non-origin SPRITE** — re-arm the raster with
|
||
FRAME_1{FBP=4, FBW=2, PSMCT32} and an 8×2 SPRITE at
|
||
(60, 4)..(67, 5) crossing the page-x boundary at x=64 (so
|
||
page_index actually changes mid-row). Pins three contracts
|
||
the origin transfer can't distinguish from a buggy
|
||
implementation: (a) `ras_fbp` reaches the swizzle's `fbp` input
|
||
(FBP=0 in Phase 1 would have masked a tied-zero regression),
|
||
(b) `ras_fbw` reaches the swizzle's `fbw` input (FBW=1 would
|
||
have masked a tied-one regression), (c) the swizzle gets the
|
||
FULL absolute pixel coords (s2_x_q, s2_y_q) rather than
|
||
bbox-local coords (Phase 1's sprite started at (0,0) so
|
||
absolute and local were equal there). Strict linear-vs-
|
||
swizzled separator at byte 10480 (where the linear formula
|
||
would land Phase-3's first pixel) stays 0. Total emit count
|
||
after all phases: 64 + 16 = 80. With Ch120 (read), Ch121
|
||
(TRXDIR upload), and Ch122 (raster emit) all live, the three
|
||
major PSMCT32 paths are byte-consistent end-to-end.
|
||
- `tb_gs_image_xfer_swizzle_psmct32.sv` (Ch121) — focused contract
|
||
for the new `PSMCT32_SWIZZLE` parameter on `gif_image_xfer_stub`.
|
||
When the parameter is set to 1 AND the upload's PSM is PSMCT32,
|
||
per-pixel VRAM byte addresses are routed through the Ch119
|
||
`gs_swizzle_psmct32_stub` (FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+
|
||
cur_y) and `dest_base_q (= DBP*256)` is added back to anchor at
|
||
the upload's DBP base. PSMCT16/PSMT8/PSMT4 always take the
|
||
linear path. Default 0 keeps every existing image-xfer TB
|
||
unchanged. **Three-phase verification**: (1) **origin transfer**
|
||
— TRXDIR upload of a 16×4 PSMCT32 image at DBP=DSAX=DSAY=0,
|
||
DBW=1, RRW=16, RRH=4 → 64 pixels, 16 IMAGE qwords. After the
|
||
upload completes, the TB reads VRAM via vram_stub's 2nd read
|
||
port at the SWIZZLED byte address (TB-side `ref_addr()` mirrors
|
||
the swizzle module) and asserts each pixel landed where the
|
||
swizzle says. Strict linear-vs-swizzled separator: bytes 512
|
||
and 768 (where linear y=2 and y=3 rows would land) stay 0 under
|
||
swizzled, since the 16×4 image only fills blocks (0,0) and (1,0)
|
||
which together cover bytes [0..127] ∪ [256..383]. (2) **scanout
|
||
agreement** — enable the Ch120 swizzled-pcrtc path on the same
|
||
VRAM contents, capture the full 16×4 frame, assert each
|
||
scanned-out pixel matches its uploaded color. Both upload and
|
||
scanout instantiate the same `gs_swizzle_psmct32_stub`, so a
|
||
successful capture proves the two integrations agree at byte
|
||
level — what was written by TRXDIR comes out on r/g/b at the
|
||
same (x, y). (3) **non-origin transfer** — re-arm with NONZERO
|
||
DBP, DSAX, and DSAY (DBP=8, DSAX=4, DSAY=2, RRW=8, RRH=4) and
|
||
verify each uploaded pixel lands at `DBP*256 + swizzle(0, DBW,
|
||
DSAX+x_local, DSAY+y_local)`. Phase 3 pins TWO contracts the
|
||
origin transfer can't distinguish from a buggy implementation:
|
||
(a) `dest_base_q (= DBP*256)` is correctly ADDED ON TOP of the
|
||
swizzle output (with DBP=0 a missing-add regression would still
|
||
pass), and (b) the swizzle is fed the FULL effective coordinates
|
||
(with DSAX=DSAY=0 a "feeds only cur_x/cur_y" regression would
|
||
still pass). Strict linear-vs-swizzled separator at byte 3088
|
||
(where the linear formula's y=2 row of the P3 image would
|
||
land) stays 0 under swizzled. NOTE: gs_stub raster writes
|
||
still use linear addressing — that wiring is a follow-on
|
||
chapter.
|
||
- `tb_gs_scanout_swizzle_psmct32.sv` (Ch120) — focused contract
|
||
for the new `PSMCT32_SWIZZLE` parameter on `gs_pcrtc_stub`. When
|
||
the parameter is set to 1 AND the active PSM is PSMCT32, PCRTC
|
||
reads VRAM at swizzled addresses (via the Ch119 swizzle module
|
||
instantiated inside pcrtc) instead of the legacy linear formula.
|
||
Other PSMs (CT16/T8/T4) and `PSMCT32_SWIZZLE=0` keep the original
|
||
linear path unchanged. Topology: TB drives `vram_stub.write_*`
|
||
directly with each pixel's color preloaded at the swizzled byte
|
||
address (TB-side `ref_addr()` mirrors the DUT swizzle math), then
|
||
pcrtc with `PSMCT32_SWIZZLE=1` scans out the frame and the TB
|
||
asserts each captured pixel matches the preloaded color. Image
|
||
is 16×4 PSMCT32 (covers blocks (0,0) AND (1,0) horizontally) at
|
||
FBP=0/FBW=1; pcrtc active area is 8×4 (block (0,0) entirely),
|
||
but the swizzle vs. linear distinction shows up at any y>0
|
||
(linear y=1 → byte 64; swizzled byte 32) so even the in-window
|
||
region is a strict separator. Per-pixel color is unique
|
||
(`{A=0xFF, B=y<<4, G=x<<4, R=0x10|(y*8+x)}`) so any wrong-
|
||
address commit surfaces immediately. NOTE: at Ch120 ONLY,
|
||
gs_stub raster writes and gif_image_xfer_stub uploads still
|
||
used linear addressing — Ch120 was read-side only. Ch121
|
||
(image-xfer) and Ch122 (raster) closed the write-side gates,
|
||
and Ch123 demonstrates all three running together end-to-end.
|
||
- `tb_gs_demo_psmt8_swizzle_trxdir_e2e.sv` (Ch136) — companion to
|
||
Ch135: same EE-bootlet → DMAC → GIF data plane and same all-
|
||
three-gates-on instantiation, but the framebuffer is filled by
|
||
a TRXDIR/IMAGE upload through `gif_image_xfer_stub` instead of
|
||
by raster. The Ch133 PSMT8 image-xfer write-side swizzle gate
|
||
becomes LOAD-BEARING inside the demo flow — every byte the GS
|
||
produces comes out of the image-xfer engine at canonical PSMT8
|
||
swizzled addresses, and the raster path is dormant. Mirrors
|
||
Ch124 PSMCT32 + Ch130 PSMCT16 TRXDIR demos for the third PSM.
|
||
Payload: U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=2,
|
||
DPSM=PSMT8} / TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} /
|
||
TRXDIR{XDIR=0}) + U2 (IMAGE, NLOOP=8: 8 IMAGE qwords each
|
||
carrying 16 PSMT8 bytes for the 16×8 image, row-major). DBW=2
|
||
is the minimum even DBW for PSMT8. DMAC QWC=14. Per-quadrant
|
||
byte indices Q0=0xA0/Q1=0x40/Q2=0xC0/Q3=0x60 reused from Ch135
|
||
so the verify side is unchanged. New `trxdir_arms_seen` counter
|
||
asserts =1 (single TRX setup) + xfer-side per-emit observer
|
||
asserts every xfer_we pulse fires with be=4'b0001, mask=
|
||
0xFFFFFFFF (PSMT8 single-byte commit shape). Verification
|
||
mirrors Ch135: (1) full 16×8 scanout frame capture; (2) per-
|
||
pixel BYTE readback at the canonical swizzled byte address
|
||
(with `addr[1:0]` selecting the right byte from the 32-bit
|
||
word) via vram_stub's 2nd port; (3) strict separators at bytes
|
||
128 and 256 stay 0. Aggregate counts: `dma=(1,14,1)
|
||
ee_dmac_wr=3 giftags=2 ad_writes=4 trxdir_arms=1
|
||
xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1
|
||
emits=0 frame=16x8`. **First-attempt PASS** errors=0. Ch135 +
|
||
Ch136 together close the PSMT8 byte-accuracy milestone end-
|
||
to-end through the full driver-shaped flow — same Ch123+Ch124
|
||
(PSMCT32) and Ch129+Ch130 (PSMCT16) shape.
|
||
- `tb_gs_demo_psmt4_swizzle_trxdir_e2e.sv` (Ch142) — companion to
|
||
Ch141 (raster-driven PSMT4 e2e): same EE-bootlet → DMAC → GIF
|
||
data plane and same all-three-gates-on instantiation, but the
|
||
framebuffer is filled by a TRXDIR/IMAGE upload through
|
||
`gif_image_xfer_stub` instead of by raster. The Ch139 PSMT4
|
||
image-xfer write-side swizzle gate becomes LOAD-BEARING inside
|
||
the demo flow — every nibble the GS produces comes out of the
|
||
image-xfer engine at canonical PSMT4 swizzled (addr,
|
||
nibble_hi) slots, and the raster path is dormant. Mirrors
|
||
Ch124's PSMCT32 TRXDIR demo, Ch130's PSMCT16 TRXDIR demo, and
|
||
Ch136's PSMT8 TRXDIR demo for the fourth (and last) common
|
||
GS PSM. Cloned from Ch136 and surgically retargeted to
|
||
PSMT4. Payload: U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=2,
|
||
DPSM=PSMT4} / TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} /
|
||
TRXDIR{XDIR=0}) + U2 (IMAGE, NLOOP=4 EOP=1: 4 IMAGE qwords
|
||
carrying 32 PSMT4 nibbles each — at RRW=16 each qword holds
|
||
2 rows: lanes 0..15 = row 2*qi, lanes 16..31 = row 2*qi+1,
|
||
matching Ch139's focused-TB packing). Total QWC = 10 (5+5).
|
||
EE-bootlet DISPFB1 immediate identical to Ch141 (LUI 0x000A;
|
||
ORI 0x0400 → PSM=PSMT4). Per-quadrant nibbles match Ch141
|
||
verbatim (Q0=0xA → 0xAA, Q1=0x4 → 0x44, Q2=0xC → 0xCC,
|
||
Q3=0x6 → 0x66) so the verify side reuses Ch141's pattern
|
||
unchanged. Verification mirrors Ch141: (1) full 16×8 scanout
|
||
frame capture via Ch138 swizzled-pcrtc; (2) per-pixel NIBBLE
|
||
readback at the canonical swizzled (addr, nibble_hi) slot
|
||
via vram_stub's 2nd port (addr[1:0]-keyed byte selection
|
||
then nibble_hi-keyed nibble selection); (3) strict linear-
|
||
vs-swizzled separator at byte 128 stays 0 (per-byte check,
|
||
not full word: a neighbor byte may legitimately be touched);
|
||
(4) per-emit observer asserts every image-xfer write is
|
||
`be=4'b0001` / `mask ∈ {0x0F, 0xF0}` (PSMT4 nibble RMW
|
||
shape) and the `trxdir_wr_q` arming pulse fires exactly
|
||
once. Aggregate counts: `dma=(1,10,1) ee_dmac_wr=3
|
||
giftags=2 ad_writes=4 trxdir_arms=1 xfer_writes=128
|
||
ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=0 frame=16x8`.
|
||
Ch141 + Ch142 together exercise BOTH PSMT4 write-side paths
|
||
(raster Ch140 + image-xfer Ch139) end-to-end through the
|
||
same driver-shaped flow with the same swizzled-scanout
|
||
(Ch138) read side — bringing PSMT4 to full parity with the
|
||
PSMCT32, PSMCT16, and PSMT8 e2e coverage from Ch123+Ch124,
|
||
Ch129+Ch130, and Ch135+Ch136. **Architectural milestone**:
|
||
this is the first state of the project where ALL FOUR
|
||
common GS PSMs (CT32 + CT16 + T8 + T4) have BOTH a raster-
|
||
driven AND a TRXDIR-driven driver-shaped end-to-end byte-
|
||
accuracy demo — closing the **four-PSM × three-path × dual-
|
||
driver-shape e2e foundation** (8 demos total). The bug-fix
|
||
iteration: TB-side `ref_col_idx4` was first written with a
|
||
7-bit case key `{yb[2:0], xb[3:0]}` covering yb=0..7 in
|
||
indices 0..127, but the values for yb=4..7 were miscopied
|
||
from Ch139's yb=12..15 row (Ch139 only exercises yb=0..3
|
||
and yb=12..15). Phase 2 readback failed for all 64 pixels
|
||
in y=4..7 with `got=0 expected=0xC/0x6` — the engine wrote
|
||
the right nibbles to the right addresses (scanout passed),
|
||
but the TB's ref looked at the wrong slot. Fix: switched to
|
||
Ch141's 9-bit case key `{yb[3:0], xb[4:0]}` and used
|
||
Ch141's verified yb=0..7 values verbatim. **First-attempt
|
||
PASS** after the table fix.
|
||
- `tb_gs_demo_psmt4_swizzle_e2e.sv` (Ch141) — first driver-shaped
|
||
end-to-end PSMT4 demo with all three PSMT4 swizzle gates
|
||
(Ch138 read-side pcrtc, Ch139 image-xfer write-side, Ch140
|
||
raster write-side) parameter-set to 1 simultaneously, but with
|
||
the demo flow exercising only the raster (Ch140) + scanout
|
||
(Ch138) paths as load-bearing. The Ch139 image-xfer gate is
|
||
smoke-only here (parameter is set but `xfer_writes_seen == 0`
|
||
is asserted, since no TRXDIR/IMAGE packet is delivered in the
|
||
raster-driven payload); the Ch139 load-bearing variant is
|
||
the Ch142 TRXDIR-driven PSMT4 e2e (mirrors Ch124/Ch130/Ch136).
|
||
PSMT4 counterpart of Ch123's PSMCT32 /
|
||
Ch129's PSMCT16 / Ch135's PSMT8 e2e demos. Same EE-bootlet →
|
||
DMAC → GIF data plane: BIOS-resident EE program configures
|
||
GS-priv (DISPFB1 PSMT4 with FBW=2, DISPLAY1, PMODE) via `sw`
|
||
instructions → kicks DMAC ch2 → SYSCALL halts. DMAC delivers
|
||
a 24-qword payload (4 SPRITE PACKED packets) through
|
||
`gif_packed_stub` into `gs_stub` raster. The 4 sprites tile
|
||
the 16×8 active area into 4 quadrants with per-quadrant unique
|
||
RGBAQ.R[3:0] nibbles (Q0=0xA → 0xAA, Q1=0x4 → 0x44,
|
||
Q2=0xC → 0xCC, Q3=0x6 → 0x66). PSMT4 raster (Ch106) takes
|
||
RGBAQ.R[3:0] as the nibble that hits VRAM via the existing
|
||
Ch106 nibble RMW machinery (write_be=4'b0001 + write_mask
|
||
0x0F or 0xF0); Ch140 keys the high/low nibble selector off the
|
||
swizzle's `nibble_hi` output instead of `s2_pixel_index[0]`.
|
||
PCRTC's Ch103 PSMT4 grayscale fallback (clut_enable=0)
|
||
surfaces the nibble as r=g=b={n, n} at scanout, so each
|
||
captured pixel IS the nibble we wrote (no CLUT setup needed
|
||
for this demo; a CLUT-driven Ch141 variant is a future
|
||
chapter). With the raster gate on, all 128 per-pixel nibble
|
||
stores go through `gs_swizzle_psmt4_stub`; with the pcrtc
|
||
gate on, scanout reads from those same swizzled (addr,
|
||
nibble_hi) slots. **Two-phase verification**: (1) full-frame
|
||
scanout asserts each (x, y) reads back its quadrant's nibble
|
||
as PSMT4 grayscale r=g=b={n, n}; (2) per-pixel NIBBLE readback
|
||
at the canonical swizzled address (with `addr[1:0]` selecting
|
||
the right byte from the 32-bit word, then `nibble_hi`
|
||
selecting which nibble of that byte) via vram_stub's 2nd
|
||
port — the 16×8 PSMT4 image lives entirely in the upper-left
|
||
of block (0,0) of page 0 (PSMT4 block = 32×16 px) and the
|
||
within-block columnTable4 yb=0..7 / xb=0..15 exercises
|
||
nibble_idx values [0..127]. Strict linear-vs-swizzled
|
||
separator at byte 128 (linear y=2 row start at PSMT4
|
||
stride=64 with FBW=2) stays 0 — outside block (0,0)'s
|
||
touched range. Per-emit observer locks PSM=0x14, be=4'b0001,
|
||
mask ∈ {0x0F, 0xF0}. Aggregate counts: `dma=(1,24,1)
|
||
ee_dmac_wr=3 giftags=4 ad_writes=20 xfer_writes=0
|
||
ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=128 frame=16x8`.
|
||
**First-attempt PASS** errors=0. Together with Ch123 (PSMCT32
|
||
e2e), Ch129 (PSMCT16 e2e), and Ch135 (PSMT8 e2e), this is the
|
||
first state of the project where the full driver-shaped flow
|
||
has end-to-end byte-accuracy demos for ALL FOUR common GS
|
||
PSMs (CT32 + CT16 + T8 + T4) under software-shaped raster
|
||
traffic. The TRXDIR-driven PSMT4 companion landed at Ch142
|
||
(mirror of Ch124/Ch130/Ch136 making Ch139 load-bearing), so
|
||
Ch141 + Ch142 together close the PSMT4 byte-accuracy
|
||
milestone end-to-end through both driver shapes — bringing
|
||
PSMT4 to full parity with CT32/CT16/T8.
|
||
- `tb_gs_demo_psmt8_swizzle_e2e.sv` (Ch135) — first driver-shaped
|
||
end-to-end PSMT8 demo with all three PSMT8 swizzle gates
|
||
(Ch132 read-side pcrtc, Ch133 image-xfer write-side, Ch134
|
||
raster write-side) parameter-set to 1 simultaneously, but with
|
||
the demo flow exercising only the raster (Ch134) + scanout
|
||
(Ch132) paths as load-bearing. The Ch133 image-xfer gate is
|
||
smoke-only here (parameter is set but `xfer_writes_seen == 0`
|
||
is asserted, since no TRXDIR/IMAGE packet is delivered in the
|
||
raster-driven payload); the Ch133 load-bearing variant is the
|
||
Ch136 TRXDIR-driven PSMT8 e2e (mirror of Ch124/Ch130). PSMT8
|
||
counterpart of Ch123's PSMCT32 / Ch129's PSMCT16 e2e demos. Same EE-bootlet → DMAC → GIF data plane:
|
||
BIOS-resident EE program configures GS-priv (DISPFB1 PSMT8
|
||
with FBW=2, DISPLAY1, PMODE) via `sw` instructions → kicks
|
||
DMAC ch2 → SYSCALL halts. DMAC delivers a 24-qword payload
|
||
(4 SPRITE PACKED packets) through `gif_packed_stub` into
|
||
`gs_stub` raster. The 4 sprites tile the 16×8 active area into
|
||
4 quadrants with per-quadrant unique RGBAQ.R values
|
||
(Q0=0xA0, Q1=0x40, Q2=0xC0, Q3=0x60). PSMT8 raster (Ch105)
|
||
takes the natural ABGR's R channel as the byte index that
|
||
hits VRAM; PCRTC's Ch96 grayscale fallback (clut_enable=0)
|
||
surfaces the byte as R=G=B at scanout, so each captured pixel
|
||
IS the byte we wrote (no CLUT setup needed for this demo;
|
||
a CLUT-driven Ch135 variant is a future chapter). With the
|
||
raster gate on, all 128 per-pixel byte stores go through
|
||
`gs_swizzle_psmt8_stub`; with the pcrtc gate on, scanout
|
||
reads from those same swizzled addresses. **Two-phase
|
||
verification**: (1) full-frame scanout asserts each (x, y)
|
||
reads back its quadrant's byte as PSMT8 grayscale R=G=B; (2)
|
||
per-pixel BYTE readback at the canonical swizzled address
|
||
(with `addr[1:0]` selecting the right byte from the 32-bit
|
||
word) via vram_stub's 2nd port — the 16×8 PSMT8 image lives
|
||
entirely in the upper half of block (0,0) of page 0 (PSMT8
|
||
block = 16×16 px) and the within-block columnTable8 yb=0..7
|
||
exercises byte values [0..127]. Strict linear-vs-swizzled
|
||
separators at bytes 128 (linear y=1 row start at PSMT8
|
||
stride=128 with FBW=2) and 256 (linear y=2) stay 0 — both
|
||
outside block (0,0)'s touched range. Aggregate counts:
|
||
`dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20
|
||
xfer_writes=0 ee_priv_wr=4 bridge_fires=4 core_halt=1
|
||
emits=128 frame=16x8`. Together with Ch123 (PSMCT32 e2e) and
|
||
Ch129 (PSMCT16 e2e), this was the first state of the project
|
||
where the full driver-shaped flow had end-to-end byte-accuracy
|
||
demos for the CT32/CT16/T8 trio under software-shaped traffic.
|
||
PSMT4 was the natural follow-on and landed at Ch141 (raster-
|
||
driven, mirror of this demo) + Ch142 (TRXDIR-driven, mirror
|
||
of Ch136), closing the four-PSM × dual-driver-shape e2e
|
||
matrix.
|
||
- `tb_gs_demo_psmct16_swizzle_trxdir_e2e.sv` (Ch130) — companion
|
||
to Ch129: same EE-bootlet → DMAC → GIF data plane and same all-
|
||
three-gates-on instantiation, but the framebuffer is filled by
|
||
a TRXDIR/IMAGE upload through `gif_image_xfer_stub` instead of
|
||
by raster. The Ch127 image-xfer write-side swizzle gate becomes
|
||
LOAD-BEARING inside the demo flow — every byte the GS produces
|
||
comes out of the image-xfer engine at canonical PSMCT16
|
||
swizzled addresses, and the raster path is dormant. Payload:
|
||
U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=1, DPSM=PSMCT16} /
|
||
TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} / TRXDIR{XDIR=0})
|
||
+ U2 (IMAGE, NLOOP=16: 16 IMAGE qwords carrying the 128 PSMCT16
|
||
halfwords of the same four-quadrant pattern Ch129 used). DMAC
|
||
QWC = 22. Verification mirrors Ch129: (1) full 16×8 scanout
|
||
frame capture; (2) per-pixel halfword readback at the canonical
|
||
swizzled byte address (with `addr[1]` selecting the right 16-bit
|
||
slot) via vram_stub's 2nd read port; (3) strict linear-vs-
|
||
swizzled separators at bytes 256 and 384 stay 0; (4) per-emit
|
||
observer asserts every image-xfer write is `be=4'b0011` /
|
||
`mask=0xFFFF_FFFF` (low halfword) and the `trxdir_wr_q` arming
|
||
pulse fires exactly once. Aggregate counts: `dma=(1,22,1)
|
||
ee_dmac_wr=3 giftags=2 ad_writes=4 trxdir_arms=1
|
||
xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1
|
||
emits=0 frame=16x8`. Ch129 + Ch130 together exercise BOTH
|
||
PSMCT16 write-side paths (raster Ch128 + image-xfer Ch127)
|
||
end-to-end through the same driver-shaped flow with the same
|
||
swizzled-scanout (Ch126) read side — bringing PSMCT16 to
|
||
full parity with the PSMCT32 e2e coverage from Ch123 + Ch124.
|
||
- `tb_gs_demo_psmct16_swizzle_e2e.sv` (Ch129) — full driver-shaped
|
||
end-to-end demo with all three PSMCT16 swizzle gates
|
||
(Ch126 read-side pcrtc, Ch127 image-xfer write-side, Ch128
|
||
raster write-side) parameter-set to 1 simultaneously, but with
|
||
the demo flow exercising only the raster (Ch128) + scanout
|
||
(Ch126) paths as load-bearing. The Ch127 image-xfer gate is
|
||
smoke-only here (parameter is set but `xfer_writes_seen == 0`
|
||
is asserted, since no TRXDIR/IMAGE packet is delivered in the
|
||
raster-driven payload); Ch130 (TRXDIR-driven PSMCT16 e2e) is
|
||
the load-bearing image-xfer-side counterpart.
|
||
PSMCT16 counterpart of Ch123's PSMCT32 e2e demo. Same EE-bootlet → DMAC
|
||
→ GIF data plane: BIOS-resident EE program configures GS-priv
|
||
(DISPFB1 PSMCT16, DISPLAY1, PMODE) via `sw` instructions →
|
||
kicks DMAC ch2 → SYSCALL halts. DMAC delivers a 24-qword
|
||
payload (4 SPRITE PACKED packets) through `gif_packed_stub`
|
||
into `gs_stub` raster. The 4 sprites tile the 16×8 active area
|
||
into 4 quadrants with per-quadrant unique RGB5A1 colors picked
|
||
so the 5→8 bit-replicate at PCRTC output produces unique 8-bit
|
||
RGB triples. With the raster gate on, all 128 per-pixel
|
||
halfword stores go through `gs_swizzle_psmct16_stub`; with the
|
||
pcrtc gate on, scanout reads from those same swizzled
|
||
addresses. **Two-phase verification**: (1) full-frame scanout
|
||
asserts each (x, y) reads back its quadrant's 5→8-expanded
|
||
RGB; (2) per-pixel halfword readback via vram_stub's 2nd port
|
||
at swizzled addresses (with `addr[1]` selecting the right
|
||
16-bit slot) confirms each sprite halfword landed where the
|
||
swizzle says — the 16×8 PSMCT16 image lives entirely in block
|
||
(0,0) of page 0 (PSMCT16 block = 16×8 px), so the readback
|
||
exercises ALL 16 xb × 8 yb entries of `columnTable16`. Strict
|
||
linear-vs-swizzled separators at bytes 256 (linear y=2 row
|
||
start at PSMCT16 stride=128) and 384 (linear y=3) stay 0 —
|
||
both outside block (0,0)'s 256-byte range. Aggregate counts:
|
||
`dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20
|
||
xfer_writes=0 ee_priv_wr=4 bridge_fires=4 core_halt=1
|
||
emits=128 frame=16x8`. Together with Ch123 (PSMCT32 e2e),
|
||
this is the first state of the project where the full
|
||
driver-shaped flow has end-to-end byte-accuracy demos for
|
||
BOTH direct-color PS2 PSMs.
|
||
- `tb_gs_raster_swizzle_psmct16.sv` (Ch128) — focused contract for
|
||
the new `PSMCT16_SWIZZLE` parameter on `gs_stub` (the raster emit
|
||
surface). Mirrors Ch122's wiring shape but for PSMCT16: when the
|
||
parameter is 1 AND the active raster PSM is PSMCT16
|
||
(`ras_psm == 6'h02`), the per-pixel raster emit address is routed
|
||
through the Ch125 `gs_swizzle_psmct16_stub` (FBP=ras_fbp, FBW=
|
||
ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) — its output is the
|
||
absolute byte address. PSMCT32 is gated by its own
|
||
`PSMCT32_SWIZZLE` parameter (Ch122). At Ch128 only, PSMT8/PSMT4
|
||
raster emits stayed linear; Ch134 later closed the PSMT8 raster
|
||
gate via `PSMT8_SWIZZLE` on this same `gs_stub`. PSMT4 raster
|
||
still takes the linear path.
|
||
Default 0 keeps every existing PSMCT16 raster TB (Ch95 etc.)
|
||
unchanged. **Three-phase verification**: (1) **origin SPRITE**
|
||
— drive a 16×4 PSMCT16 SPRITE at FRAME_1{FBP=0, FBW=1, PSMCT16}
|
||
with RGBAQ {R=0xAA, G=0x50, B=0xC0, A=0x00} → halfword 0x6155
|
||
(R5=0x15, G5=0x0A, B5=0x18, A1=0). Per-pixel halfword readback
|
||
via vram_stub's 2nd port (with `addr[1]` selecting the right
|
||
16-bit slot) confirms each lands at the swizzled byte. The
|
||
16×4 image lives in block (0,0) of page (0,0), so within-block
|
||
columnTable16 rows 0..3 are exercised. **Strict separators**:
|
||
bytes 128 (linear y=1 row start at PSMCT16 stride=128) and 256
|
||
(linear y=2) stay 0 — proves the gate is live, since a fall-
|
||
through to the legacy linear path would put the SPRITE
|
||
halfword there. (2) **scanout agreement** — enable the Ch126
|
||
swizzled-pcrtc path on the same VRAM contents, capture the
|
||
full 16×4 frame, assert each visible pixel reads back the
|
||
expected RGB after PCRTC's 5→8 bit-replicate (RGB = {0xAD,
|
||
0x52, 0xC6}). Both gs_stub (Ch128 raster) and gs_pcrtc_stub
|
||
(Ch126 scanout) instantiate the same swizzle module. (3)
|
||
**non-origin SPRITE** — re-arm with FRAME_1{FBP=4, FBW=2,
|
||
PSMCT16} and an 8×4 SPRITE at (60, 4)..(67, 7) with distinct
|
||
color (halfword 0x9F8E). Crosses the PAGE-x boundary at x=64
|
||
(page (0,0) for x∈[60..63] — block (0,3) by swizzle table —
|
||
vs page (1,0) for x∈[64..67] — block (0,0)) so page_index
|
||
changes mid-row. Within-block column-table coords (xb=12..3,
|
||
yb=4..7) cover columnTable16 rows 4..7 — a different region
|
||
than Phase 1's yb=0..3. Pins three contracts Phase 1 can't:
|
||
(a) `ras_fbp` reaches the swizzle's `fbp` input (FBP=0 in P1
|
||
would mask a tied-zero); (b) `ras_fbw` reaches `fbw` (FBW=1
|
||
in P1 would mask a tied-one); (c) the swizzle gets the FULL
|
||
absolute pixel coords s2_x_q/s2_y_q rather than bbox-local
|
||
(P1's sprite started at (0,0), so absolute and local were
|
||
equal). Strict P3 separator at byte 9336 (linear formula's
|
||
effective (60, 4) byte) stays 0 — outside the P3 swizzled
|
||
write set, which lives in block (0,3) of page (0,0)
|
||
(10914..11006) and block (0,0) of page (1,0) (16512..16604).
|
||
Total emit count after all phases: 64 + 32 = 96. With Ch126
|
||
(read), Ch127 (TRXDIR upload), and Ch128 (raster emit) all
|
||
live, the three major PSMCT16 paths are byte-consistent
|
||
end-to-end — completes the byte-accuracy milestone for the
|
||
second PSM, mirroring the Ch120/Ch121/Ch122 PSMCT32 closure.
|
||
- `tb_gs_image_xfer_swizzle_psmct16.sv` (Ch127) — focused contract
|
||
for the new `PSMCT16_SWIZZLE` parameter on `gif_image_xfer_stub`.
|
||
Mirrors Ch121's wiring shape but for PSMCT16: when the parameter
|
||
is 1 AND the upload's PSM is PSMCT16, per-pixel byte addresses
|
||
route through the Ch125 `gs_swizzle_psmct16_stub` (FBP=0,
|
||
FBW=DBW, x=DSAX+cur_x, y=DSAY+cur_y) and `dest_base_q
|
||
(= DBP*256)` is added back to anchor at the upload's DBP base.
|
||
PSMCT32 is gated by its own PSMCT32_SWIZZLE parameter (Ch121);
|
||
PSMT8/T4 always linear. Default 0 keeps every existing PSMCT16
|
||
image-xfer TB unchanged. **Three-phase verification**: (1)
|
||
**origin transfer** — TRXDIR upload of a 16×4 PSMCT16 image at
|
||
DBP=DSAX=DSAY=0, DBW=1, RRW=16, RRH=4 → 64 pixels, 8 IMAGE
|
||
qwords (8 px/qword for PSMCT16). After upload, the TB reads
|
||
vram_stub's 2nd port at the SWIZZLED byte address (TB-side
|
||
`ref_addr16/ref_block_idx16/ref_col_idx16` carry the verbatim
|
||
PCSX2 tables locked at Ch125) and asserts each halfword landed
|
||
where the swizzle says (selecting the right 16-bit slot inside
|
||
the 32-bit word via `addr[1]`). Strict linear-vs-swizzled
|
||
separators at bytes 128 (linear y=1) and 256 (linear y=2) stay
|
||
0 — swizzled writes for the 16×4 image fill only block (0,0)
|
||
bytes [0..126]. (2) **scanout agreement** — enable the Ch126
|
||
swizzled-pcrtc path on the same VRAM contents, capture the
|
||
full 16×4 frame, assert each scanned pixel matches the uploaded
|
||
RGB5A1 → RGB888 5→8 bit-replicate. Both upload and scanout
|
||
instantiate the same `gs_swizzle_psmct16_stub`. (3) **non-origin
|
||
transfer** — re-arm with DBP=8, DSAX=12, DSAY=6, RRW=8, RRH=4.
|
||
Effective coords (12..19, 6..9) cross block_x=0→1 at
|
||
effective_x=16 AND block_y=0→1 at effective_y=8, exercising
|
||
both block-table dimensions inside a single non-origin upload.
|
||
Pins three contracts the origin transfer can't distinguish from
|
||
a buggy implementation: (a) `dest_base_q (= DBP*256)` is added
|
||
on top of the swizzle output (DBP=0 in P1 would mask a
|
||
missing-add); (b) the swizzle is fed the FULL effective coords
|
||
(DSAX=DSAY=0 in P1 would mask a "feeds only cur_x/cur_y"
|
||
regression); (c) BOTH block_x and block_y propagate through
|
||
`blockTable16[by][bx]` (block_x=0 throughout P1 would mask a
|
||
tied-block_x regression). Strict P3 separator at byte 3096
|
||
(linear formula's effective (12, 8) byte) stays 0 — outside
|
||
the P3 swizzled write set [2048..3071]. NOTE (now historical):
|
||
PSMCT16 raster swizzle was deferred when Ch127 landed; it
|
||
shipped at Ch128 (mirrors Ch122 for PSMCT32) so the PSMCT16
|
||
raster path is now byte-consistent with the image-xfer path
|
||
documented here.
|
||
- `tb_gs_raster_swizzle_psmt4.sv` (Ch140) — focused contract for
|
||
the new `PSMT4_SWIZZLE` parameter on `gs_stub` (the raster emit
|
||
surface). Mirrors Ch122/Ch128/Ch134 wiring shape but for the
|
||
fourth (and last) PSM, and threads the Ch137 swizzle module's
|
||
`nibble_hi` output into the existing Ch106 PSMT4 raster nibble
|
||
RMW data lane (replacing `s2_pixel_index[0]` as the high/low
|
||
nibble selector when the gate is on). When the parameter is 1
|
||
AND the active raster PSM is PSMT4 (`ras_psm == 6'h14`), the
|
||
per-pixel raster emit address is routed through the Ch137
|
||
`gs_swizzle_psmt4_stub` (FBP=ras_fbp, FBW=ras_fbw,
|
||
x=s2_x_q[11:0], y=s2_y_q[11:0]) — its `addr` output is the
|
||
absolute byte address, AND its `nibble_hi` output keys
|
||
`s2_emit_color64`'s nibble placement and `s2_emit_mask`'s
|
||
high/low gating (write_be stays 4'b0001 for both paths).
|
||
PSMCT32/PSMCT16/PSMT8 are gated by their own parameters;
|
||
default 0 keeps every existing PSMT4 raster TB (Ch106
|
||
raster_psmt4, Ch107 PSMT4-e2e, Ch103 PSMT4+CLUT, Ch104 round-
|
||
trip, etc.) on the original linear addressing. No new ports.
|
||
Default-off smoke verification: ran Ch106 + Ch107 + Ch103 +
|
||
Ch104 PSMT4 TBs before writing the new TB; all PASSed
|
||
unchanged. **Three-phase verification** (mirrors Ch134 PSMT8
|
||
raster shape, with PSMT4 nibble adaptations + CLUT-disabled
|
||
grayscale at scanout):
|
||
(1) **origin SPRITE** at FBP=0/FBW=2 (FBW must be even per
|
||
PCSX2 GSLocalMemory.h:560 — same as PSMT8). Drive a 16×4 PSMT4
|
||
SPRITE with RGBAQ.R=0xAA (PSMT4 raster channel takes R[3:0] as
|
||
the nibble per Ch106 → nibble = 0xA). Per-pixel nibble readback
|
||
via vram_stub's 2nd port (with `addr[1:0]`-keyed byte
|
||
selection then `nibble_hi`-keyed nibble selection inside the
|
||
byte) confirms each pixel landed at the correct (byte, nibble)
|
||
slot. The image lives in the upper-left of block (0,0) of page
|
||
(0,0); within-block columnTable4 entries for yb=0..3, xb=0..15
|
||
cover nibble_idx values [0..127] → byte_in_block ∈ [0..63].
|
||
Strict separator: byte 64 (linear y=1 row start at PSMT4
|
||
FBW=2 stride 64) stays 0.
|
||
(2) **scanout agreement** — enable Ch138 swizzled-pcrtc on
|
||
the same VRAM, capture full 16×4 frame, assert each pixel
|
||
reads back as PSMT4 grayscale R=G=B={0xA, 0xA} = 0xAA. Both
|
||
gs_stub and gs_pcrtc_stub instantiate the same
|
||
`gs_swizzle_psmt4_stub` AND thread its `nibble_hi` output
|
||
through their respective nibble selectors — agreement at this
|
||
layer means both integrations land at the same byte+nibble
|
||
positions for PSMT4.
|
||
(3) **non-origin SPRITE** at FBP=4/FBW=4 (bw_pg=2) drawing
|
||
8×4 SPRITE at (124, 4)..(131, 7) with R=0x55 (nibble = 0x5).
|
||
Crosses PSMT4 PAGE-x at x=128 (page (0,0) for x∈[124..127],
|
||
page (1,0) for x∈[128..131]). 2 blocks visited:
|
||
blockTable4[0][3]=10 → page (0,0) block_base 10752;
|
||
blockTable4[0][0]=0 → page (1,0) block_base 16384. Pins three
|
||
contracts the origin transfer can't: ras_fbp reaches the
|
||
swizzle's fbp input; ras_fbw reaches fbw; the swizzle gets
|
||
the FULL absolute pixel coords s2_x_q/s2_y_q. Strict P3
|
||
separator at byte 8766 (linear (124, 4) at FBP=4/FBW=4) stays
|
||
0 — outside the P3 swizzled write set [10752..11007] +
|
||
[16384..16639]. Total emit count: 64 + 32 = 96. **First-
|
||
attempt PASS** errors=0.
|
||
With Ch138 (read-side), Ch139 (TRXDIR upload), and Ch140
|
||
(raster emit) all live, the three major PSMT4 paths can be
|
||
byte-consistent under the canonical swizzle when their gates
|
||
are flipped on — completing the **four-PSM × three-path
|
||
byte-accuracy foundation** (CT32 Ch120/Ch121/Ch122 + CT16
|
||
Ch126/Ch127/Ch128 + T8 Ch132/Ch133/Ch134 + T4 Ch138/Ch139/
|
||
Ch140). End-to-end PSMT4 swizzled demos (mirroring Ch123/
|
||
Ch124, Ch129/Ch130, Ch135/Ch136) are now possible.
|
||
- `tb_gs_raster_swizzle_psmt8.sv` (Ch134) — focused contract for
|
||
the new `PSMT8_SWIZZLE` parameter on `gs_stub` (the raster emit
|
||
surface). Mirrors Ch122's PSMCT32 + Ch128's PSMCT16 wiring shape
|
||
but for the third PSM: when the parameter is 1 AND the active
|
||
raster PSM is PSMT8 (`ras_psm == 6'h13`), the per-pixel raster
|
||
emit address is routed through the Ch131 `gs_swizzle_psmt8_stub`
|
||
(FBP=ras_fbp, FBW=ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) —
|
||
its output is the absolute byte address. PSMCT32/PSMCT16 are
|
||
gated by their own parameters; PSMT4 stays linear. Default 0
|
||
keeps every existing PSMT8 raster TB (Ch105 raster_psmt8, Ch107
|
||
PSMT4-via-CT16-CLUT palette path, etc.) on the original linear
|
||
addressing. No new ports — parameter-only API change. Default-
|
||
off smoke verification: ran Ch105 `tb_gs_raster_psmt8` before
|
||
writing the new TB; PASSed unchanged. **Three-phase verification**
|
||
(mirrors Ch128 PSMCT16 raster shape):
|
||
(1) **origin SPRITE** at FBP=0/FBW=2 (DBW must be even — PCSX2
|
||
asserts `(bw & 1) == 0` for PSMT8). Drive a 16×8 PSMT8 SPRITE
|
||
with RGBAQ.R=0xA5 (PSMT8 raster channel uses R as the byte
|
||
index per Ch105). Per-pixel byte readback via vram_stub's 2nd
|
||
port confirms each lands at the swizzled byte. The 16×8 image
|
||
lives in the upper half of block (0,0) of page (0,0); the
|
||
within-block columnTable8 distributes the 128 bytes across yb
|
||
rows 0..7 — byte values 0..127 within the block. **Strict
|
||
separators**: bytes 128 (linear y=1 row start at PSMT8
|
||
stride=128) and 256 (linear y=2) stay 0 — proves the gate is
|
||
live, since a fall-through to the legacy linear path would put
|
||
the SPRITE byte there. (2) **scanout agreement** — enable the
|
||
Ch132 swizzled-pcrtc path on the same VRAM, capture the full
|
||
16×8 frame, assert each pixel's PCRTC PSMT8 grayscale R=G=B
|
||
matches `idx=0xA5`. Both gs_stub and gs_pcrtc_stub instantiate
|
||
the same `gs_swizzle_psmt8_stub`, so success proves byte-level
|
||
agreement. (3) **non-origin SPRITE** at FBP=4/FBW=4 (bw_pg=2)
|
||
drawing 8×4 SPRITE at (124, 4)..(131, 7) with RGBAQ.R=0x5A.
|
||
Crosses PSMT8 PAGE-x at x=128 (x∈[124..127] is in page (0,0)
|
||
block (0,7) by swizzle table; x∈[128..131] is in page (1,0)
|
||
block (0,0)) so page_index changes mid-row. Pins three
|
||
contracts the origin transfer can't: `ras_fbp` reaches the
|
||
swizzle's fbp input (FBP=0 in P1 would mask a tied-zero);
|
||
`ras_fbw` reaches fbw (FBW=2 would mask a tied-two); the
|
||
swizzle gets the FULL absolute pixel coords s2_x_q/s2_y_q
|
||
rather than bbox-local (P1 sprite started at (0,0) so
|
||
absolute=local). PSMT8's page-x boundary at x=128 is different
|
||
from CT32/CT16's x=64, so this exercises the PSMT8-specific
|
||
x[7] wiring of the swizzle. Strict P3 separator at byte 9340
|
||
(linear (124, 4) at FBP=4/FBW=4) stays 0 — outside the P3
|
||
swizzled write set (page (0,0) block (0,7) at base 13568, page
|
||
(1,0) block (0,0) at base 16384). Total emit count: 128 + 32 =
|
||
160. **First-attempt PASS** errors=0. With Ch132 (read-side),
|
||
Ch133 (TRXDIR upload), and Ch134 (raster emit) all live, the
|
||
three major PSMT8 paths can be byte-consistent under the
|
||
canonical swizzle when their gates are flipped on — completing
|
||
the third-PSM byte-accuracy milestone for ALL three integration
|
||
points (mirrors the Ch120/Ch121/Ch122 PSMCT32 trio + the
|
||
Ch126/Ch127/Ch128 PSMCT16 trio).
|
||
- `tb_gs_image_xfer_swizzle_psmt4.sv` (Ch139) — focused contract
|
||
for the new `PSMT4_SWIZZLE` parameter on `gif_image_xfer_stub`.
|
||
Mirrors Ch121/Ch127/Ch133 wiring shape but for the fourth (and
|
||
last) PSM, and threads the Ch137 swizzle module's `nibble_hi`
|
||
output into the existing Ch118 nibble RMW data lane (replacing
|
||
`x_eff[0]` as the high/low nibble selector when the gate is
|
||
on). When the parameter is 1 AND the active DPSM is PSMT4, the
|
||
per-pixel byte address is `dest_base_q (= DBP*256) +
|
||
swizzle_psmt4(FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+cur_y).addr`,
|
||
AND `cur_mask_c` is `0x0000_00F0` when `swizzle4_nibble_hi=1`
|
||
(high nibble) or `0x0000_000F` when 0 (low nibble) — the
|
||
per-bit write_mask machinery (vram_stub merges only the
|
||
targeted nibble) layers on top of the swizzled address. PSMCT32
|
||
/PSMCT16/PSMT8 are gated by their own parameters. Default 0
|
||
keeps the legacy linear path for every existing PSMT4 image-
|
||
xfer TB (Ch118 etc.). No new ports — parameter-only API
|
||
change. Default-off smoke verification: ran Ch118
|
||
`tb_gs_image_xfer_psmt4` before writing the new TB; PASSed
|
||
unchanged. **Three-phase verification** (mirrors Ch127/Ch133
|
||
audit-closed shape): (1) **origin write-side lock** at DBP=0/
|
||
DBW=2/DSAX=DSAY=0 (DBW must be even per PCSX2 GSLocalMemory.h:
|
||
560 — same FBW-evenness as PSMT8). 16×4 PSMT4 image upload via
|
||
2 IMAGE qwords (32 px/qword for PSMT4 = 4 rows × 16-px row at
|
||
RRW=16). After upload, per-pixel nibble readback at the
|
||
swizzled `(addr, nibble_hi)` slot asserts each nibble landed
|
||
where the swizzle says. Strict separator: PSMT4 row stride at
|
||
DBW=2 = DBW*32 = 64 bytes, so linear y=1 starts at byte 64.
|
||
Swizzled write set lives in [0..63] within block (0,0). Byte
|
||
64 stays 0 (verified via per-byte check, not full-word — the
|
||
`check_byte_zero` task initially had a full-word bug that
|
||
misreported neighbor-byte writes; fixed to check only the
|
||
targeted byte via `addr[1:0]`-keyed case statement).
|
||
(2) **end-to-end agreement**: enable Ch138 PSMT4 swizzled
|
||
scanout on the same VRAM (PSMT4_SWIZZLE=1 on pcrtc, CLUT
|
||
disabled), capture the 16×4 frame, verify each pixel's grayscale
|
||
R=G=B={nibble, nibble} matches `nibble_at(xx, yy)`. Both
|
||
modules instantiate the same `gs_swizzle_psmt4_stub` so success
|
||
proves byte+nibble-level agreement under TRXDIR-style emit +
|
||
scanout-style read.
|
||
(3) **non-origin transfer** at DBP=8/DBW=2/DSAX=28/DSAY=12/
|
||
RRW=8/RRH=8. Effective coords (28..35, 12..19) cross block_x=
|
||
0→1 at effective_x=32 AND block_y=0→1 at effective_y=16 (PSMT4
|
||
block geometry: 32×16 px). All 4 corner blocks of page (0,0)
|
||
at DBP=8 visited: blockTable4[0][0]=0, [0][1]=2, [1][0]=1,
|
||
[1][1]=3 (block bases 2048/2560/2304/2816). Pins three
|
||
contracts the origin transfer can't: dest_base_q ADDED ON TOP
|
||
of the swizzle output (DBP=0 in P1 would mask a missing-add
|
||
regression — fixed during bring-up after the TB initially
|
||
passed P3_DBP directly to ref_pos_psmt4 instead of using
|
||
fbp_v=0 + adding DBP*256); FULL effective coords; BOTH
|
||
block_x and block_y propagate through `blockTable4[by][bx]`.
|
||
Phase 3 strict separator: linear formula puts effective coord
|
||
(28, 12) at byte 2830 — under linear, the neighboring pixel
|
||
(29, 12) writes high nibble = 1 to that byte. Under swizzled,
|
||
no Phase-3 pixel hits byte 2830 (cross-checked: col_idx_psmt4
|
||
for the 4-block × 16-pixel coord set never produces nibble_idx
|
||
28 or 29). Byte 2830 stays 0 → fall-through to linear would
|
||
have stomped it with 0x10. **PASS** errors=0 after two bug-fix
|
||
iterations: (a) ref_pos_psmt4(P3_DBP, ...) was wrong — engine
|
||
feeds FBP=0 to the swizzle and adds DBP*256 separately, so TB
|
||
must do the same; (b) check_byte_zero tested the full word
|
||
instead of the targeted byte, producing false failures when a
|
||
neighbor byte in the same word was independently touched.
|
||
Counts: arms=2, writes=128 (P1 64 + P3 64). With Ch138 (read-
|
||
side scanout) + Ch139 (image-xfer write-side) + Ch140 (raster
|
||
write-side) all live, the Ch137 PSMT4 primitive now has all 3
|
||
integration points wired, and Ch141 closes the e2e demo.
|
||
- `tb_gs_image_xfer_swizzle_psmt8.sv` (Ch133) — focused contract
|
||
for the new `PSMT8_SWIZZLE` parameter on `gif_image_xfer_stub`.
|
||
Mirrors Ch121's PSMCT32 + Ch127's PSMCT16 wiring shape but for
|
||
the third PSM: when the parameter is 1 AND the active DPSM is
|
||
PSMT8, the per-pixel byte address is `dest_base_q (= DBP*256) +
|
||
swizzle_psmt8(FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+cur_y)`.
|
||
PSMCT32/PSMCT16 are gated by their own parameters; PSMT4 stays
|
||
linear (its swizzle math is future). Default 0 keeps the legacy
|
||
linear path for every existing PSMT8 image-xfer TB (Ch117 etc.).
|
||
No new ports — parameter-only API change. Default-off smoke
|
||
verification: ran Ch117 `tb_gs_image_xfer_psmt8` before writing
|
||
the new TB; PASSed unchanged. **Three-phase verification**
|
||
(mirrors Ch127 audit-closed shape):
|
||
(1) **origin write-side lock** at DBP=0/DBW=2 (DBW must be even
|
||
per PCSX2 GSLocalMemory.h:553 — PSMT8 pages are 128 px wide vs
|
||
FBW's 64-px units, so 2 FBW units per page → bw_pg=1 here).
|
||
16×8 PSMT8 image upload via 8 IMAGE qwords (16 px/qword). Per-
|
||
pixel index `idx_at(x, y) = (y[2:0] << 4) | x[3:0]` ∈
|
||
[0x00..0x7F]. After upload, byte-readback at the swizzled
|
||
address asserts each byte landed where the swizzle says. Strict
|
||
separators: linear y=1 (byte 128) and y=2 (byte 256) row starts
|
||
stay 0 — swizzled write set lives entirely in [0..127].
|
||
(2) **end-to-end agreement**: enable Ch132 swizzled scanout on
|
||
the same VRAM, capture the frame, verify each visible pixel's
|
||
PCRTC PSMT8 grayscale R=G=B matches `idx_at(x, y)`. Both modules
|
||
instantiate the same `gs_swizzle_psmt8_stub` so success proves
|
||
byte-level agreement under TRXDIR-style emit + scanout-style
|
||
read. (3) **non-origin transfer** at DBP=8/DBW=2/DSAX=12/DSAY=10/
|
||
RRW=8/RRH=8. Effective coords (12..19, 10..17) cross block_x=0→1
|
||
at effective_x=16 AND block_y=0→1 at effective_y=16, so all 4
|
||
corner blocks of page (0,0) at DBP=8 (blockTable8[0][0]=0,
|
||
[0][1]=1, [1][0]=2, [1][1]=3 → block bases 2048/2304/2560/2816)
|
||
are visited. Pins three contracts the origin transfer can't:
|
||
`dest_base_q = DBP*256` ADDED ON TOP; the swizzle is fed FULL
|
||
effective coords (DSAX/DSAY non-zero); BOTH block_x and block_y
|
||
propagate through `blockTable8[by][bx]`. Phase 3 distinct-pixel
|
||
pattern uses `p3_idx = 0x80 | idx` ∈ [0x80..0xFF] (disjoint
|
||
from Phase 1's [0x00..0x7F]) so a P3 pixel landing at a P1
|
||
byte (or vice versa) surfaces as wrong RGB. Phase 3 strict
|
||
separator: linear formula puts effective coord (12, 10) at
|
||
byte `2048 + 10*128 + 12 = 3340` (outside swizzled set
|
||
[2048..3071]); byte 3340 stays 0 — proves a fall-through to
|
||
linear would have stomped that byte. **First-attempt PASS**:
|
||
arms=2, writes=192 (=128+64), errors=0. NOTE: at Ch133 only,
|
||
PSMT8 raster-side emits via `gs_stub` still used linear
|
||
addressing — Ch133 was image-xfer write-side only. Ch134 later
|
||
closed the raster-side gate via `PSMT8_SWIZZLE` on `gs_stub`
|
||
(mirrors Ch122 for PSMCT32 and Ch128 for PSMCT16) — see Ch134
|
||
row above.
|
||
- `tb_gs_scanout_swizzle_psmt4.sv` (Ch138) — focused contract for
|
||
the new `PSMT4_SWIZZLE` parameter on `gs_pcrtc_stub`. Mirrors
|
||
Ch120/Ch126/Ch132's read-side-first wiring shape but adds the
|
||
PSMT4-specific twist: the swizzle module outputs both an
|
||
absolute byte address AND a `nibble_hi` selector (PSMT4 = 4
|
||
bits/pixel = 2 pixels per byte, and the canonical PCSX2 column
|
||
table reorders nibbles within a block, so `pixel_index[0]`
|
||
is no longer the right selector under the swizzled layout).
|
||
When the parameter is 1 AND the active PSM is PSMT4, scanout
|
||
reads go through the Ch137 `gs_swizzle_psmt4_stub` and the
|
||
PSMT4 nibble extractor uses `swizzle4_nibble_hi` instead of
|
||
`pixel_index[0]`. PSMCT32/PSMCT16/PSMT8 are gated by their own
|
||
parameters; default 0 keeps every existing PSMT4 scanout TB
|
||
(Ch103 PSMT4+CLUT, Ch104 PSMT4 round-trip, Ch107 PSMT4 e2e,
|
||
etc.) on the legacy linear path. No new ports — parameter-
|
||
only API change. Default-off smoke verification: ran Ch103
|
||
`tb_gs_scanout_psmt4_clut` + Ch104 `tb_gs_psmt4_round_trip`
|
||
before writing the new TB; both PASSed unchanged. **Two-phase
|
||
verification** (mirrors Ch132 closure shape; CLUT disabled so
|
||
PCRTC's PSMT4 grayscale fallback gives `r=g=b={nibble,
|
||
nibble}` at scanout):
|
||
(1) **origin** at FBP=0/FBW=2/DBX=DBY=0 (FBW must be even per
|
||
PCSX2 GSLocalMemory.h:560 because PSMT4 pages are 128 px wide,
|
||
same as PSMT8). 16×4 region preloaded at swizzled bytes via a
|
||
TB-side `byte_shadow` accumulator that lays each pixel's
|
||
nibble at its `(addr, nibble_hi)` slot; bytes are then flushed
|
||
to vram_stub via per-byte BE writes. Per-pixel nibble pattern
|
||
`nibble_at(x, y) = ((y << 1) ^ x) & 4'h7` ∈ [0..7] gives unique
|
||
gray values across the 16×4 frame. The image lives entirely
|
||
in block (0,0) of page (0,0) and exercises within-block
|
||
columnTable4 entries for yb=0..3, xb=0..15. Strict separator:
|
||
byte 64 (linear y=1 row start at FBW=2 stride) pre-colored
|
||
with sentinel 0xCC (gray=0xCC, unproducible by Phase 1's
|
||
[0..7]-nibble pattern) — fall-through to linear would surface
|
||
as RGB(0xCC, 0xCC, 0xCC).
|
||
(2) **non-origin** at FBP=4/FBW=4 (bw_pg=2), DBX=120, DBY=126.
|
||
Effective coords range x∈[120..135], y∈[126..129]. page_x
|
||
crosses 0→1 at effective_x=128, page_y crosses 0→1 at
|
||
effective_y=128 (PSMT4's 128-tall page boundary — different
|
||
from PSMT8's 64-tall). All 4 corner pages of FBP=4/FBW=4
|
||
visited, each with a distinct blockTable4 lookup
|
||
(blockTable4[7][3]=31 → page (0,0) block_base 16128;
|
||
blockTable4[7][0]=21 → page (1,0) block_base 21760;
|
||
blockTable4[0][3]=10 → page (0,1) block_base 27136;
|
||
blockTable4[0][0]=0 → page (1,1) block_base 32768). A
|
||
regression that tied any of {dispfb_fbp, dbx, dby, FBW,
|
||
block_x, block_y, page_index, bw_pg=FBW/2, swizzle
|
||
nibble_hi} to zero would NOT survive Phase 2. Strict P2
|
||
separator: byte 24380 (linear formula's place for (120, 126);
|
||
outside all 4 swizzled chunks) pre-colored with sentinel 0xDD
|
||
→ fall-through to linear would surface as RGB(0xDD, 0xDD,
|
||
0xDD), unproducible by the Phase-2 pattern. **PASS** errors=0
|
||
after one bug-fix iteration: Phase 2's flush-loop initially
|
||
hardcoded the wrong byte ranges due to a `blockTable4[7][3]`
|
||
lookup mistake (the value is 31, not 15) — replaced with a
|
||
shadow-array sweep [256..65535] that flushes any non-zero
|
||
byte, eliminating the hardcode/lookup mismatch class entirely.
|
||
NOTE (now historical): Ch138 was read-side only when
|
||
introduced; the PSMT4 write-side is now live as well — Ch139
|
||
(image-xfer) + Ch140 (raster) + Ch141 (raster-driven e2e
|
||
demo). With Ch138, **all four common GS PSMs now have read-
|
||
side byte-accuracy under their swizzle gates** (CT32 Ch120 +
|
||
CT16 Ch126 + T8 Ch132 + T4 Ch138).
|
||
- `tb_gs_scanout_swizzle_psmt8.sv` (Ch132) — focused contract for
|
||
the new `PSMT8_SWIZZLE` parameter on `gs_pcrtc_stub`. Mirrors
|
||
Ch120/Ch126's wiring shape but for PSMT8: when the parameter is
|
||
1 AND the active PSM is PSMT8, scanout reads go through the
|
||
Ch131 `gs_swizzle_psmt8_stub` (real PS2 GS page/block/column
|
||
layout — 128×64 pixel pages, 4×8 block grid, 16×16 within-block
|
||
bytes, `bw_pg = FBW>>1`) instead of the legacy linear
|
||
`FBW*64*y + x` formula. PSMCT32/PSMCT16 are gated by their own
|
||
parameters; PSMT4 stays linear (its swizzle math is future).
|
||
Default PSMT8_SWIZZLE=0 keeps every existing PSMT8 scanout TB
|
||
(Ch96 storage-only, Ch97 PSMT8+CLUT, Ch103 PSMT4-via-CT16-CLUT,
|
||
Ch107 PSMT4-e2e palette path) on the original linear addressing.
|
||
No new ports — parameter-only API change. Default-off smoke
|
||
verification: ran Ch96 `tb_gs_scanout_psmt8` before writing the
|
||
new TB; PASSed unchanged, confirming the new instance + 4-way
|
||
mux extension don't disturb the linear path. **Two-phase
|
||
verification** (mirrors Ch126 PSMCT16 closure shape):
|
||
(1) **origin** (FBP=0, FBW=2, DBX=DBY=0; FBW must be even —
|
||
PCSX2 asserts `(bw & 1) == 0` for PSMT8 because pages are 128 px
|
||
wide vs FBW's 64-px units, so 2 FBW units per page → bw_pg=1
|
||
here). 16×8 region preloaded at swizzled bytes; per-pixel index
|
||
`idx = (y[2:0] << 4) | x[3:0]` ∈ [0x00..0x7F] surfaces as
|
||
grayscale R=G=B=idx via PCRTC's PSMT8 fallback (Ch96). x∈[0..15]
|
||
is entirely block_x_in_page=0, so the within-block test
|
||
exercises ALL 16 xb positions of `columnTable8` across yb rows
|
||
0..7. Strict separators: linear y=1 starts at byte 128 (FBW=2
|
||
stride) but swizzled lands at byte 8 (`columnTable8[1][0]=8`,
|
||
no `*2` scale since PSMT8 is 1 byte/pixel); linear x=8,y=0 is
|
||
byte 8 but swizzled is byte 2. (2) **non-origin** (FBP=4,
|
||
FBW=4 → bw_pg=2, DBX=120, DBY=60). Effective coords range
|
||
x∈[120..135], y∈[60..67] — page_x crosses 0→1 at effective_x=128
|
||
(proves x[7] reaches the page-x lane of the PSMT8 swizzle —
|
||
different boundary from CT16/CT32's x[6]); page_y crosses 0→1
|
||
at effective_y=64; block_x and block_y both flip; ALL 4 pages
|
||
(0,0)/(1,0)/(0,1)/(1,1) are visited, each with a distinct
|
||
blockTable8 lookup ([3][7]=31, [3][0]=10, [0][7]=21, [0][0]=0).
|
||
A regression that tied any of {dispfb_fbp, dbx, dby, FBW,
|
||
block_x, block_y, page_index, bw_pg=FBW/2} to zero would NOT
|
||
survive Phase 2. **Sentinel separator**: byte 24500 (inside
|
||
linear range 23672..25479 for the Phase-2 effective region,
|
||
outside ALL 4 swizzled write-set blocks) pre-colored with 0xFF
|
||
→ fall-through to linear would surface as RGB(0xFF, 0xFF, 0xFF),
|
||
which is unproducible by the Phase-2 unique pattern (idx ∈
|
||
[0x00..0x7F]). **First-attempt PASS** errors=0 — no audit
|
||
iteration needed because Phase 2's coord choices were designed
|
||
up front to make all 7 chain-layer wires load-bearing AND the
|
||
page-x crossing boundary is at PSMT8's specific x=128 (not the
|
||
64-px boundary the direct-color PSMs use). NOTE (now historical):
|
||
Ch132 was read-side only when introduced; Ch133 then Ch134
|
||
closed the image-xfer + raster write sides for PSMT8, so all
|
||
three PSMT8 swizzle integration points are now live (mirrors
|
||
Ch120/Ch121/Ch122 for PSMCT32 and Ch126/Ch127/Ch128 for PSMCT16).
|
||
- `tb_gs_scanout_swizzle_psmct16.sv` (Ch126) — focused contract
|
||
for the new `PSMCT16_SWIZZLE` parameter on `gs_pcrtc_stub`.
|
||
Mirrors Ch120's wiring shape but for PSMCT16: when the
|
||
parameter is 1 AND the active PSM is PSMCT16, scanout reads
|
||
go through the Ch125 `gs_swizzle_psmct16_stub` (real PS2 GS
|
||
page/block/column layout) instead of the legacy linear
|
||
`FBW*64*y + x*2` formula. PSMCT32 is gated by its own
|
||
`PSMCT32_SWIZZLE` parameter (Ch120); PSMT8/PSMT4 stay linear.
|
||
Default 0 keeps every existing PSMCT16 scanout TB
|
||
(Ch94/Ch95/Ch103/etc.) on the original linear addressing.
|
||
Topology: TB drives `vram_stub.write_*` directly with each
|
||
pixel's RGB5A1 halfword preloaded at the swizzled byte address
|
||
(TB-side `ref_addr16()` mirrors the swizzle math + the Ch125
|
||
source-table-locked tables); pcrtc with `PSMCT16_SWIZZLE=1`
|
||
scans out the 16×8 frame and the TB asserts each captured
|
||
pixel matches the preloaded color after 5→8 bit-replicate.
|
||
Per-pixel pattern is unique per (x, y): R5=`(x^y)&0xF`,
|
||
G5=`x&0xF`, B5=`y&0xF`, expanded to 8 bits via PCRTC's
|
||
bit-replicate. The PSMCT16 swizzle vs. linear distinction
|
||
shows up at any y>0 (linear y=1 → byte 128 with FBW=1, but
|
||
swizzled within block (0,0) yb=1 → columnTable16[1][0]=4
|
||
→ byte 8) and at x=8, y=0 (linear byte 16 vs swizzled byte 2)
|
||
so even within the first row + first block, the gate is a
|
||
strict separator. NOTE (now historical): Ch126 was read-side
|
||
only when introduced; Ch127 (image-xfer) then Ch128 (raster)
|
||
closed the PSMCT16 write sides, mirroring Ch121/Ch122 for
|
||
PSMCT32.
|
||
- `tb_gs_swizzle_psmt4.sv` (Ch137) — focused contract for the new
|
||
`gs_swizzle_psmt4_stub` math primitive: a pure-comb module mapping
|
||
`(FBP, FBW, x, y)` to a VRAM **byte address + nibble_hi selector**
|
||
using the real PS2 GS PSMT4 layout (8 KiB pages organized as
|
||
128×128 PSMT4 pixels — 4× as many pixels per page as PSMT8 since
|
||
each PSMT4 pixel is a NIBBLE; 32 blocks/page in an 8-rows × 4-cols
|
||
grid (same orientation as PSMCT16's blockTable16); each block
|
||
32×16 pixels = 512 nibbles = 256 bytes; **512-entry within-block
|
||
column table** — 2× the entries of PSMT8's 256-entry table due to
|
||
the doubled block area, indexed [yb][xb] with yb=0..15 + xb=0..31
|
||
→ nibble 0..511). PSMT4 is the most complex of the four common GS
|
||
PSMs because each pixel is HALF a byte, so the swizzle outputs
|
||
both a byte address and a `nibble_hi` selector (=0 for low
|
||
nibble of the byte at `addr`, =1 for high). PSMT4 reuses PSMT8's
|
||
page-stride convention (`bw_pg = FBW >> 1`; PCSX2 asserts FBW
|
||
must be even at GSLocalMemory.h:560 because PSMT4 pages are 128
|
||
px wide). Source-table provenance pinned: `_blockTable4` taken
|
||
verbatim from pcsx2/GS/GSTables.cpp lines 61–69; `columnTable4`
|
||
from same file lines 147–213. Master HEAD commit
|
||
`3000e113e2b3a76357c08dfa80d3c747f40e2706`; file blob SHA
|
||
`3581209b8217378f473f9de22a9dbc8c45ca49b6` (same blob Ch131
|
||
pinned). Cross-checked against GSLocalMemory.h:558
|
||
`BlockNumber4` + the `pxOffset` template at GSTables.cpp:247–258
|
||
(blockSize=512 in NIBBLE units, pageSize=16384 nibble units =
|
||
8192 bytes, pageWidth=128). The existing per-bit write_mask
|
||
0x0F/0xF0 nibble RMW from Ch106/Ch118 will still apply on top
|
||
of the swizzled byte address — the swizzle module doesn't touch
|
||
the nibble merge logic; it just produces (addr, nibble_hi).
|
||
**Five-phase verification** (mirrors Ch125/Ch131 shape, scaled
|
||
up): (1) **spot-checks** at 15 hand-computed corners (origin,
|
||
intra-block xb=1/8/16/yb=1/yb=2-with-hi-nibble, last nibble of
|
||
block (0,0), first/second/third/fourth horizontal blocks,
|
||
second-row-of-blocks origin, page-x at x=128 + page-y at y=128,
|
||
FBP=4 origin, page0-last-pixel (127,127) → addr 8191 hi=1).
|
||
(2a) **INDEPENDENT column-table source lock** — 32 hard-coded
|
||
`check_nibble()` calls for yb=0 (literal-by-literal verbatim
|
||
from PCSX2 columnTable4 row 0) PLUS a programmatic walk for
|
||
yb=1..15 against the in-TB ref function (480 more checks);
|
||
Phase 2a's literal yb=0 row + Phase 5's bijectivity sweep +
|
||
Phase 3's literal block-table lock together pin the table.
|
||
(3) **INDEPENDENT block-table source lock** — 32 hard-coded
|
||
checks (one per block in page 0) with expected block index
|
||
taken VERBATIM from PCSX2 blockTable4. (4) block-swizzle walk
|
||
via in-TB ref_block_idx4. (5) **bijectivity sweep over the
|
||
128×128 page** — 16384 NIBBLE slots (vs PSMT8's 8192 byte
|
||
slots), every pixel must hit a unique (byte_addr, nibble_hi)
|
||
pair and agree with both the in-TB ref byte address AND
|
||
ref nibble_hi. Plus multi-page sanity at FBW=4/bw_pg=2
|
||
(page-x crossing at x=192 → byte 10496 with blockTable4[1][2]=9,
|
||
and page-y crossing at y=128 → byte 16384) and non-page-aligned
|
||
FBP coverage at FBP ∈ {1,2,3}, including FBP=3+FBW=4+page-(1,1)
|
||
intra-block at (129, 129) → byte 30732 (= 6144 + 3*8192 + 0*256
|
||
+ ref_col_idx4(1,1)/2 = 30720 + 12). **First-attempt PASS**
|
||
errors=0. NOTE: This module is NOT YET wired into
|
||
`gs_pcrtc_stub` / `gif_image_xfer_stub` / `gs_stub` — those
|
||
still use linear PSMT4 addressing as of Ch137. The math is
|
||
locked here so follow-on chapters can wire `PSMT4_SWIZZLE`
|
||
parameter gates into the existing address paths without
|
||
disturbing the legacy linear-PSMT4 TBs (Ch103 / Ch106 / Ch107
|
||
/ Ch118). With Ch119 PSMCT32 + Ch125 PSMCT16 + Ch131 PSMT8 +
|
||
Ch137 PSMT4, **all four common GS PSMs now have byte-accurate-
|
||
to-real-PS2 swizzle math available as standalone primitives** —
|
||
the four-PSM swizzle math foundation is complete. Future
|
||
chapters can wire PSMT4 into pcrtc/image-xfer/raster behind a
|
||
PSMT4_SWIZZLE parameter (mirroring Ch120→Ch124 / Ch126→Ch130
|
||
/ Ch132→Ch136), with the existing nibble RMW machinery layered
|
||
on top.
|
||
- `tb_gs_swizzle_psmt8.sv` (Ch131) — focused contract for the new
|
||
`gs_swizzle_psmt8_stub` math primitive: a pure-comb module mapping
|
||
`(FBP, FBW, x, y)` to a VRAM byte address using the real PS2 GS
|
||
PSMT8 layout (8 KiB pages organized as 128×64 PSMT8 pixels — 2×
|
||
wider than CT16's 64×64 page; 32 blocks/page in a 4-rows × 8-cols
|
||
grid; each block 16×16 pixels = 256 bytes; **256-entry within-
|
||
block column table** — 2× the entries of CT16's 128-entry table
|
||
due to the doubled block area, indexed [yb][xb] with yb=0..15 +
|
||
xb=0..15 → byte 0..255). PSMT8 also introduces a new page-stride
|
||
constant `bw_pg = FBW >> 1` (PCSX2 asserts `(bw & 1) == 0` at
|
||
GSLocalMemory.h:553 because PSMT8 pages are 128 px wide vs FBW's
|
||
64-px units → 2 FBW units per PSMT8 page, so FBW must be even).
|
||
Source-table provenance pinned: `blockTable8` taken verbatim from
|
||
pcsx2/GS/GSTables.cpp lines 53–59; `columnTable8` from same file
|
||
lines 111–145. Master HEAD commit
|
||
`3000e113e2b3a76357c08dfa80d3c747f40e2706`; file blob SHA
|
||
`3581209b8217378f473f9de22a9dbc8c45ca49b6`. Cross-checked against
|
||
GSLocalMemory.h:551 `BlockNumber8` + the `pxOffset` template at
|
||
GSTables.cpp:247–258 (blockSize=256, pageSize=8192, pageWidth=128).
|
||
PCSX2's `bp` is in 256-byte block-pointer units; in our
|
||
FBP=2048-byte units, `bp = FBP * 8` so `bp*256 = FBP*2048`.
|
||
**Five-phase verification** (mirrors Ch125 PSMCT16 shape):
|
||
(1) **spot-checks** at 15 hand-computed corners (origin, intra-
|
||
block xb=1/4/8/yb=1, last byte of block (0,0), first/second block
|
||
origins, second row of blocks, third+fourth blocks, page-x at
|
||
x=128 and page-y at y=64, FBP=4 origin); (2a) **INDEPENDENT
|
||
column-table source lock** — 256 hard-coded `check()` calls (one
|
||
per (yb, xb) inside block (0,0)) where the expected byte index is
|
||
taken VERBATIM from PCSX2 columnTable8 with `<literal>` arithmetic,
|
||
NOT derived from the in-TB ref function. Catches any case where
|
||
DUT and ref share the same miscopy (the same trap Ch125 added
|
||
Phase 2a for with PSMCT16's column table); (2b) within-block
|
||
16×16 walk via the in-TB ref_col_idx8 (self-check); (3)
|
||
**INDEPENDENT block-table source lock** — 32 hard-coded checks
|
||
(one per block in page 0) with the expected block index taken
|
||
VERBATIM from PCSX2 blockTable8, NOT derived from the in-TB ref;
|
||
(4) block-swizzle walk via in-TB ref_block_idx8; (5)
|
||
**bijectivity sweep over the 128×64 page** — 8192 byte slots
|
||
(vs CT16's 4096 halfword slots), every pixel must hit a unique
|
||
byte address in `[0, 8192)` and agree with the in-TB reference.
|
||
Plus multi-page sanity at FBW=4/bw_pg=2 (page-x crossing at
|
||
x=192 and page-y crossing at y=64) and non-page-aligned FBP
|
||
coverage at FBP ∈ {1, 2, 3}, including FBP=3+FBW=4+page-(1,1)
|
||
intra-block crossing at (129, 65). **First-attempt PASS**
|
||
errors=0. NOTE: This module is NOT YET wired into
|
||
`gs_pcrtc_stub` / `gif_image_xfer_stub` / `gs_stub` — those
|
||
still use linear PSMT8 addressing as of Ch131. The math is
|
||
locked here so follow-on chapters can wire `PSMT8_SWIZZLE`
|
||
parameter gates into the existing address paths without
|
||
disturbing the legacy linear-PSMT8 TBs (Ch96 / Ch97 / Ch103 /
|
||
Ch105 / Ch107 / Ch117). With Ch119 PSMCT32 + Ch125 PSMCT16 +
|
||
Ch131 PSMT8, three of the four common GS PSMs now have byte-
|
||
accurate-to-real-PS2 swizzle math available as standalone
|
||
primitives; PSMT4 (with its 32×16 nibble intra-block layout) is
|
||
the natural Ch132 candidate.
|
||
- `tb_gs_swizzle_psmct16.sv` (Ch125) — focused contract for the
|
||
new `gs_swizzle_psmct16_stub` math primitive: a pure-comb module
|
||
mapping `(FBP, FBW, x, y)` to a VRAM byte address using the real
|
||
PS2 GS PSMCT16 layout (8 KiB pages organized as 64×64 PSMCT16
|
||
pixels; 32 blocks/page in a 4×8 grid; each block 16×8 pixels =
|
||
256 bytes; **non-trivial within-block column table** — unlike
|
||
PSMCT32 where within-block IS row-major halfwords by accident,
|
||
PSMCT16 has 4 internal 16×2-pixel sub-columns with a 128-entry
|
||
permutation). Source-table provenance pinned: `blockTable16`
|
||
taken verbatim from pcsx2/GS/GSTables.cpp lines 29–39
|
||
(master HEAD commit 3d71e310; file-touch commit d983b2b0,
|
||
2026-01-12); `columnTable16` from same file lines 91–109.
|
||
Cross-check against the older Debian-packaged GSdx
|
||
`PixelAddressOrg16(x, y, bp, bw) = (BlockNumber16(...) << 7) +
|
||
columnTable16[y & 7][x & 15]` confirms the address chain
|
||
(`<< 7` lifts to halfword units, multiply by 2 for bytes; in
|
||
our FBP=2048-byte units, bp = FBP * 8 so bp*256 = FBP*2048).
|
||
**Five-phase verification**: (1) spot-checks at 13 well-defined
|
||
corners (origin, intra-block, first/second block, second row of
|
||
blocks, page-x and page-y boundaries, FBP=4 origin); (2)
|
||
within-block 16×8 walk asserting `byte = 2 * columnTable16[yb][xb]`
|
||
— locks the column table; a row-major-halfwords regression would
|
||
fail; (3) **source-table lock** — 32 hard-coded address checks
|
||
(one per block in page 0) with the expected block index taken
|
||
VERBATIM from PCSX2 blockTable16, NOT derived from the in-TB
|
||
reference function; (4) block-swizzle walk cross-checking the
|
||
in-TB ref function against the DUT (the bijectivity sweep
|
||
relies on it being correct); (5) **bijectivity sweep over the
|
||
64×64 page** — 4096 halfword slots, every pixel must hit a
|
||
unique halfword address in `[0, 8192)` and agree with the in-TB
|
||
reference. Plus multi-page sanity at FBW=2 and non-page-aligned
|
||
FBP coverage at FBP ∈ {1, 2, 3} (real PS2 supports any
|
||
2048-byte-aligned FBP — same broadening Ch119 adopted post-
|
||
audit). NOTE: This module is NOT YET wired into `gs_pcrtc_stub`
|
||
/ `gif_image_xfer_stub` / `gs_stub` — those still use linear
|
||
PSMCT16 addressing as of Ch125. The math is locked here so
|
||
follow-on chapters can wire `PSMCT16_SWIZZLE` parameter gates
|
||
into the existing address paths without disturbing the legacy
|
||
linear-PSMCT16 TBs (Ch94 / Ch95 / Ch103 / Ch116).
|
||
- `tb_gs_swizzle_psmct32.sv` (Ch119) — focused contract for the
|
||
new `gs_swizzle_psmct32_stub` math primitive: a pure-combinational
|
||
module mapping `(FBP, FBW, x, y)` to a VRAM byte address using
|
||
the real PS2 GS PSMCT32 page/block swizzle layout (8 KiB pages,
|
||
4×8 grid of 8×8-pixel blocks per page, blocks ordered per the
|
||
canonical PCSX2 PSMCT32 swizzle table, row-major within a block).
|
||
Verification has five phases: (1) spot-checks on the well-defined
|
||
corners — origin, intra-block walks, first/second block, second
|
||
row of blocks, page-x and page-y boundaries, second page on x,
|
||
and FBP=4 origin; (2) within-block 8×8 walk asserting
|
||
`byte_in_block = yb*32 + xb*4`; (3) **source-table lock** — 32
|
||
hard-coded address checks (one per block in page 0) where the
|
||
expected block index is taken VERBATIM from PCSX2's PSMCT32 block
|
||
table, NOT derived from the in-TB reference function. This proves
|
||
the DUT's `swizzle_psmct32()` table matches the canonical source;
|
||
a copied-wrong table that happened to still be a valid permutation
|
||
of 0..31 would fail this phase, while the bijectivity sweep below
|
||
would pass it; (4) block-swizzle walk (redundant with phase 3,
|
||
cross-checks ref_block_idx against the DUT — the bijectivity
|
||
sweep relies on ref_block_idx being correct); (5) bijectivity
|
||
sweep over the full 64×32 PSMCT32 page — every word slot in
|
||
`[0, 8192)` reached exactly once (catches any swap/typo in the
|
||
swizzle table). Plus a multi-page sanity check at FBW=2 (pixel
|
||
(96, 16) → block (4,2) of page 1 → addr 14336) and a **non-page-
|
||
aligned FBP** phase that drives FBP=1, 2, 3 (mid-page in the 8 KiB
|
||
sense — real PS2 supports any 2048-byte-aligned FBP; our address
|
||
formula is bit-correct for non-page-aligned FBP) plus FBP=3 with
|
||
FBW=2 + intra-block crossing as a stress case. NOTE (now
|
||
historical): at Ch119 this module was standalone math only;
|
||
Ch120 (PCRTC read), Ch121 (image-xfer write), and Ch122
|
||
(raster write) wired it into the three integration points —
|
||
the same shape that Ch125–Ch128 (PSMCT16), Ch131–Ch134
|
||
(PSMT8), and Ch137–Ch140 (PSMT4) followed for the other
|
||
three PSMs.
|
||
- `tb_gs_image_xfer_psmt4.sv` (Ch118) — focused contract for
|
||
`gif_image_xfer_stub`'s PSMT4 path (the fourth and final
|
||
supported PSM). PSMT4 packs 0.5 bytes/pixel (4-bit nibble per
|
||
pixel = 2 px/byte), so each 128-bit IMAGE qword carries 32
|
||
pixels in 16 bytes. Each emit is a SUB-BYTE write: `write_be
|
||
= 4'b0001` with a per-emit nibble mask
|
||
(`write_mask = 0x0000_000F` for the LOW nibble,
|
||
`0x0000_00F0` for the HIGH nibble), keyed by `(DSAX+x)[0]`;
|
||
vram_stub's per-bit merge commits exactly the targeted
|
||
nibble, preserving the OTHER nibble of the byte.
|
||
Back-to-back emits to the same byte (e.g. x=0 + x=1 of the
|
||
same row) chain through NBA semantics without bypass logic
|
||
— the same trick the raster channel uses since Ch106. The TB
|
||
is INTENTIONALLY adversarial: VRAM is preloaded with `0xA5`
|
||
across every byte the engine will write (plus boundary
|
||
bytes), then a single IMAGE qword (32 PSMT4 pixels) covers
|
||
the entire 8×4 rect. Every byte ends as
|
||
`{nibble_high_pixel, nibble_low_pixel}` (no trace of 0xA5);
|
||
bytes immediately right of the rect on each row stay 0xA5
|
||
(proves no nibble leak past RRW); bytes before / after the
|
||
destination region also stay 0xA5. Pattern
|
||
`pixel(x,y) = 4'((y*8+x) & 0xF)`. Asserts: 1 trxdir arm, 32
|
||
vram writes, every emit `be=0001` and `mask ∈ {0x0F, 0xF0}`,
|
||
per-byte readback matches, boundary bytes preserved.
|
||
- `tb_gs_image_xfer_psmt8.sv` (Ch117) — focused contract for
|
||
`gif_image_xfer_stub`'s PSMT8 path. Pushes 2 IMAGE qwords
|
||
(32 PSMT8 pixels = 16 px/qword × 2) through the engine after
|
||
a TRXDIR-shaped GIF-A+D register sequence with DPSM=PSMT8
|
||
(=0x13). PSMT8 packs 1 byte/pixel (an 8-bit CLUT index), so
|
||
each qword holds 16 pixels; the engine emits one 8-bit pixel
|
||
per cycle with `write_be = 4'b0001`, the index in the LOW
|
||
byte of `write_data`, and `write_mask = 0xFFFFFFFF`;
|
||
vram_stub commits `mem[write_addr] <= write_data[7:0]` at
|
||
any byte alignment. Pattern is `pixel(x,y) = 8'(y*16 + x)` —
|
||
32 distinct values across the 8×4 rect so a wrong-byte-lane
|
||
commit shows up unambiguously. Asserts: 1 trxdir arm, 32
|
||
vram writes (all `be=0001`, `mask=0xFFFFFFFF`), every pixel
|
||
reads back at `dest_base + y*64 + x`, plus right-of-rect /
|
||
before / after byte-zero boundary preservation. Each qword
|
||
packs TWO rows of 8 pixels (lanes 0..7 = row y, lanes 8..15
|
||
= row y+1) — exercises the per-lane row-stride math at the
|
||
qword boundary.
|
||
- `tb_gs_image_xfer_psmct16.sv` (Ch116) — focused contract for
|
||
`gif_image_xfer_stub`'s new PSMCT16 path. Pushes 4 IMAGE
|
||
qwords (32 PSMCT16 pixels = 8 px/qword × 4) through the
|
||
engine after a TRXDIR-shaped GIF-A+D register sequence
|
||
(BITBLTBUF/TRXPOS/TRXREG/TRXDIR). PSMCT16 packs 2 bytes/pixel,
|
||
so each qword holds 8 pixels (vs 4 for PSMCT32). The engine
|
||
emits one 16-bit pixel per cycle to vram_stub with
|
||
`write_be = 4'b0011`, the pixel value in the LOW halfword of
|
||
`write_data`, and `write_mask = 0xFFFFFFFF`; vram_stub commits
|
||
the 2 bytes at the 2-byte-aligned destination address. Pattern
|
||
is `pixel(x,y) = 16'h{yyxx}{yyxx}` — distinct per-pixel value
|
||
so a wrong-lane commit shows up unambiguously. Asserts:
|
||
1 trxdir arm, 32 vram writes (all `be=0011`, `mask=0xFFFFFFFF`),
|
||
every pixel reads back at `dest_base + y*row_stride + x*2`,
|
||
and the bytes immediately right of the rect on each row +
|
||
before the dest region + after the dest region all stay zero
|
||
(proves row-stride math + no halfword leak past RRW). PSMT8
|
||
image-xfer landed in Ch117 and PSMT4 image-xfer landed in
|
||
Ch118 — see those TB rows for their own per-byte / per-nibble
|
||
contract coverage.
|
||
- `tb_gs_demo_psmt4_e2e_trxdir.sv` (Ch110) — driver-shaped
|
||
PSMT4 demo with the palette upload now arriving via a real
|
||
TRXDIR/TRXPOS/TRXREG/HWREG image-transfer GIF packet sequence
|
||
instead of TB-direct vram_stub writes. Closes the LAST
|
||
TB-direct path in the e2e demo flow: every byte the GS sees —
|
||
framebuffer pixels AND palette source — now arrives through a
|
||
driver-shaped GIF stream. The DMAC delivers 36 qwords total:
|
||
U1 (PACKED, NREG=4): BITBLTBUF/TRXPOS/TRXREG/TRXDIR — TRXDIR
|
||
arms `gif_image_xfer_stub`. U2 (IMAGE, NLOOP=4): 4 qwords of 4
|
||
PSMCT32 entries each → 16 palette entries written into VRAM at
|
||
DBP*256 by `gif_image_xfer_stub`. Then 4 SPRITE PACKED packets
|
||
+ 1 TEX0_1 PACKED packet. PASS criteria add to Ch109's:
|
||
**1 EV_DMA_START / 36 EV_DMA_BEAT / 1 EV_DMA_DONE**, **7
|
||
GIFtag accepts** (U1 + U2 + 4×SPRITE + TEX0), **25 PACKED A+D
|
||
dispatches** (4 TRX-setup + 20 SPRITE + 1 TEX0), **16
|
||
image-xfer VRAM writes** from `gif_image_xfer_stub` (DBP=4,
|
||
DBW=1, DPSM=PSMCT32, DSAX=DSAY=0, RRW=16, RRH=1). The vram_stub
|
||
write port is muxed at TB level: `xfer_busy ? xfer_we :
|
||
raster_pixel_emit` (sequenced — palette upload completes before
|
||
sprites raster). Ch110 also added a backpressure path on
|
||
`gif_packed_stub` (`image_data_ready` input) so the upstream
|
||
DMA stalls while `gif_image_xfer_stub` is draining the previous
|
||
IMAGE qword's 4 PSMCT32 lanes; outside S_IMAGE the gate is a
|
||
no-op (in_ready stays high). Privileged-block MMIO (PMODE/
|
||
DISPFB1/DISPLAY1) remains TB-direct because those are CPU MMIO
|
||
writes in real hardware, not GIF traffic.
|
||
- `tb_gs_demo_psmt4_e2e_dmac.sv` (Ch109) — same 4-quadrant
|
||
PSMT4 demo as Ch108, but the GIFtag + PACKED A+D quadwords
|
||
arrive at `gif_packed_stub` via the DMAC channel-2 →
|
||
`ee_memory_map_stub` → `ee_ram_stub` path instead of being
|
||
TB-driven directly. Closes the last GIF-side sideband from
|
||
Ch108: the demo is now reachable the way real EE/IOP code
|
||
reaches it. The TB pre-stages the same 26 qwords (4 SPRITE
|
||
packets × 6 qwords + 1 TEX0_1 packet × 2 qwords) into RAM at
|
||
PAYLOAD_MADR, then writes DMAC channel-2 MADR/QWC/CHCR; a
|
||
single NORMAL transfer with QWC=26 streams them into the GIF.
|
||
PASS criteria add to Ch108's: **1 EV_DMA_START / 26
|
||
EV_DMA_BEAT / 1 EV_DMA_DONE** (DMA event taxonomy locked),
|
||
with the same downstream chain — 5 GIFtag accepts, 21 A+D
|
||
dispatches in the expected reg-num sequence, 32 PSMT4 emits,
|
||
1 loader_busy rise, identical 16×8 captured frame. Privileged-
|
||
block MMIO and palette pre-stage stay TB-direct (NOT GIF-side);
|
||
TRXDIR/HWREG image-transfer for palette upload is a separate
|
||
future chapter.
|
||
- `tb_gs_demo_psmt4_e2e_packed.sv` (Ch108) — same 4-quadrant
|
||
PSMT4 demo as Ch107 but routed through the GIFtag / PACKED
|
||
A+D front-end (`gif_packed_stub` with REAL_AD_REG_MAP=1).
|
||
Closes the last bit of GS-side sideband from Ch107: instead
|
||
of TB-driving `gs_stub.gif_reg_*` directly, the TB pushes raw
|
||
128-bit GIFtag + PACKED A+D quadwords into `gif_packed_stub.
|
||
in_*` exactly the way the real GIF would receive them from
|
||
PATH3. Each SPRITE is a packet of 1 GIFtag (NLOOP=1, NREG=5,
|
||
PACKED, REGS=0xEEEEE — 5×A+D in the low 5 nibble slots) +
|
||
5 PACKED A+D qwords (PRIM, FRAME_1=PSMT4, RGBAQ, XYZ2, XYZ2);
|
||
TEX0_1 load is its own 1-tag/1-A+D packet. Total: 5 GIFtag
|
||
accepts (4 SPRITEs + 1 TEX0_1) and 4×5 + 1×1 = 21 PACKED A+D
|
||
register-write dispatches into gs_stub.gif_reg_*. 32 PSMT4
|
||
raster emits arrive (Ch106 RMW), loader fires exactly once
|
||
on TEX0_1, and the captured 16×8 frame matches the same
|
||
expected CLUT-decoded RGB as Ch107 — i.e. real-format GIF
|
||
packets reach the GS register file with the same cadence the
|
||
TB previously synthesised by hand. Privileged-block MMIO
|
||
(PMODE/DISPFB1/DISPLAY1) and the palette pre-stage in VRAM
|
||
remain TB-direct because they are NOT GIF-side; the palette
|
||
upload via real-PS2 TRXDIR/TRXPOS/TRXREG/HWREG image-transfer
|
||
packets is a separate future chapter, as is the DMAC channel-2
|
||
burst that would normally deliver the GIFtag qwords (this TB
|
||
drives `gif_packed_stub.in_*` directly to keep the demo
|
||
narrow and deterministic; the full DMAC→RAM→GIF round trip
|
||
is what the integration-tier `tb_ee_core_gif_*` family
|
||
covers).
|
||
- `tb_gs_psmt4_round_trip.sv` (Ch104) — full driver-shaped
|
||
PSMT4 + CLD=4 + CSA round trip. Wires `gs_stub` +
|
||
`vram_stub` + `clut_stub` + `clut_loader_stub` + `gs_pcrtc_stub`
|
||
end-to-end with `pcrtc.clut_csa = gs_stub.tex0_1_csa_q` (the
|
||
Ch98 sideband-free pattern). Phase 1: stages a 4×4 PSMT4 sprite
|
||
in VRAM, plus a 16-entry pattern_a palette in VRAM at
|
||
`CBP_A*256`. Drives TEX0_1 with `CBP=4, CPSM=PSMCT32, CSM=CSM2,
|
||
CSA=0, CLD=4`; the loader writes pattern_a into `clut_stub[0..15]`
|
||
and `pcrtc.clut_csa` is 0, so PSMT4 scanout reads pattern_a per
|
||
nibble. Phase 2: stages a different pattern_b palette at
|
||
`CBP_B*256` and drives TEX0_1 with `CBP=8, CSA=4, CLD=4`; the
|
||
loader writes pattern_b into `clut_stub[64..79]` (the CSA=4
|
||
window) and `pcrtc.clut_csa` flips to 4, so the same VRAM
|
||
sprite — same DISPFB1 / DISPLAY1 / PMODE — now reads pattern_b.
|
||
Proves loader policy + clut_stub contents + PCRTC lookup are
|
||
wired consistently.
|
||
|
||
Scope (current, after Ch165):
|
||
|
||
- **PSMCT32 (DISPFB1.PSM=0), PSMCT16 (PSM=2), PSMT8 (PSM=0x13),
|
||
and PSMT4 (PSM=0x14) honored at BOTH the read and write
|
||
sides** (Ch94 + Ch95 + Ch96 + Ch97 + Ch103 + Ch105 + Ch106).
|
||
PSMCT24/PSMCT16S/PSMZ32/etc. force scanout off and are not
|
||
contract-tested at the raster channel. The write side
|
||
(gs_stub.raster_pixel_emit) emits the four supported PSMs via
|
||
`raster_pixel_be_q` (per-byte gate) and `raster_pixel_mask_q`
|
||
(per-bit merge mask, Ch106): PSMCT32 = be `0xF` / mask
|
||
`0xFFFFFFFF`, PSMCT16 = be `0x3` / mask `0xFFFFFFFF`, PSMT8 =
|
||
be `0x1` / mask `0xFFFFFFFF`, PSMT4 = be `0x1` / mask `0x0F`
|
||
or `0xF0`. The mask path is no-op for byte-or-larger PSMs
|
||
(mem[i] = data[i] when mask_i = 0xFF) and only meaningful for
|
||
PSMT4 sub-byte writes. PSMT8 / PSMT4
|
||
scanout surfaces the index/nibble as grayscale by default;
|
||
with `clut_enable=1` (Ch97/Ch103) and a programmed
|
||
`clut_stub`, the index/nibble looks up real RGB. CLUT contents come either from a TB-direct write OR
|
||
(Ch99..Ch102) from a VRAM→CLUT load triggered by a TEX0_1 GIF
|
||
write with `CSM == 1` (CSM2 linear), `CPSM` ∈ {PSMCT32,
|
||
PSMCT16}, and a CLD value passing the policy: CLD=0 never;
|
||
CLD=1 always (full 256-entry load); CLD=2 if CBP changed since
|
||
last load (full); CLD=3 if CBP/CPSM/CSA any-changed (full);
|
||
CLD=4 always but only the 16-entry CSA window at indices
|
||
`CSA*16 + i` (Ch102 — preserves the other 240 entries);
|
||
CLD ∈ {5..7} silently no-op (reserved). `clut_loader_stub`
|
||
walks the entries via `vram_stub`'s second read port; PSMCT16
|
||
entries are unpacked with the same 5→8 bit-replicate the
|
||
scanout side uses (Ch94). CSM1 swizzle and CPSM ∉ {PSMCT32,
|
||
PSMCT16} remain deferred.
|
||
- **Single CRTC, single DISPFB**. Real PS2 has two interlace-
|
||
capable CRTCs (DISPFB1, DISPFB2). One context is enough for
|
||
TBs to verify the round trip; PMODE.EN2 + DISPFB2 + DISPLAY2
|
||
is deferred.
|
||
- **Read-side addressing**. Linear by default (legacy formula
|
||
`vram_read_addr = FBP*2048 + (effective_y*FBW*64 + effective_x)
|
||
<< bpp_shift`). Four OPTIONAL per-PSM swizzle paths gated by
|
||
parameters on `gs_pcrtc_stub`: `PSMCT32_SWIZZLE=1` (Ch120)
|
||
routes PSMCT32 reads through `gs_swizzle_psmct32_stub`;
|
||
`PSMCT16_SWIZZLE=1` (Ch126) routes PSMCT16 reads through
|
||
`gs_swizzle_psmct16_stub`; `PSMT8_SWIZZLE=1` (Ch132) routes
|
||
PSMT8 reads through `gs_swizzle_psmt8_stub` (Ch131) — FBW must
|
||
be even because PSMT8 pages are 128 px wide and the swizzle
|
||
internally divides FBW by 2; `PSMT4_SWIZZLE=1` (Ch138) routes
|
||
PSMT4 reads through `gs_swizzle_psmt4_stub` (Ch137); FBW must
|
||
be even (same as PSMT8). The four parameters are independent —
|
||
enabling one doesn't affect the others. PSMT4's swizzle module
|
||
also outputs a `nibble_hi` selector that PCRTC uses in place of
|
||
`pixel_index[0]` to pick which nibble of the byte at the
|
||
swizzled address holds this pixel (PSMT4 packs 2 pixels per
|
||
byte and the canonical PCSX2 column table reorders nibbles
|
||
within a block, so the linear formula's `pixel_index[0]`
|
||
selector is no longer correct under the swizzled layout). All
|
||
four swizzle parameter defaults are 0 so all existing PCRTC-
|
||
using TBs see the legacy linear behavior unchanged. The
|
||
PSMT4 image-xfer (Ch139) and raster (Ch140) write-side
|
||
wiring is now live as well, completing the four-PSM × three-
|
||
path swizzle integration. Both driver-shape e2e demos for
|
||
PSMT4 are also live: raster-driven (Ch141) and TRXDIR-driven
|
||
(Ch142). All four common GS PSMs now have BOTH driver-shape
|
||
e2e demos (CT32 Ch123+Ch124, CT16 Ch129+Ch130, T8 Ch135+
|
||
Ch136, T4 Ch141+Ch142) — closing the four-PSM × three-path
|
||
× dual-driver-shape e2e foundation.
|
||
- **Parallel to `platform_video_stub`, not a replacement**. We
|
||
did not extend `platform_video_stub` (which would have
|
||
rippled through 6 existing TBs). Pcrtc is the alternative
|
||
video source for TBs that want VRAM-backed scanout. The legacy
|
||
flood-fill module stays as-is.
|
||
|
||
### End-to-end demo manifest (Ch143)
|
||
|
||
Eight driver-shaped end-to-end byte-accurate demos cover the
|
||
four common GS PSMs across both driver shapes (raster-driven
|
||
PACKED-SPRITE payload + TRXDIR-driven IMAGE payload). Each demo
|
||
runs the same EE-bootlet → DMAC → GIF → GS → vram → swizzled-
|
||
PCRTC chain with all three same-PSM swizzle gates parameter-set
|
||
to 1; the listed write-side path is load-bearing and the other
|
||
write-side path is asserted dormant in the demo flow.
|
||
|
||
All eight demos emit a 16×8 framebuffer (128 pixels). The raster
|
||
column shows `(emits, xfer_writes)`; the TRXDIR column shows
|
||
`(xfer_writes, emits)` — in both cases the load-bearing path
|
||
fires 128 times and the dormant path is asserted 0.
|
||
|
||
| PSM | Raster-driven e2e | TRXDIR-driven e2e |
|
||
|---------|---------------------------------|------------------------------------|
|
||
| PSMCT32 | Ch123 — `tb_gs_demo_psmct32_swizzle_e2e` (128, 0) | Ch124 — `tb_gs_demo_psmct32_swizzle_trxdir_e2e` (128, 0) |
|
||
| PSMCT16 | Ch129 — `tb_gs_demo_psmct16_swizzle_e2e` (128, 0) | Ch130 — `tb_gs_demo_psmct16_swizzle_trxdir_e2e` (128, 0) |
|
||
| PSMT8 | Ch135 — `tb_gs_demo_psmt8_swizzle_e2e` (128, 0) | Ch136 — `tb_gs_demo_psmt8_swizzle_trxdir_e2e` (128, 0) |
|
||
| PSMT4 | Ch141 — `tb_gs_demo_psmt4_swizzle_e2e` (128, 0) | Ch142 — `tb_gs_demo_psmt4_swizzle_trxdir_e2e` (128, 0) |
|
||
|
||
For each row both demos use the same per-quadrant pixel pattern
|
||
(so the verify side is shared across the row), the same DBW-
|
||
even constraint where applicable (PSMT8 / PSMT4: 128-px-wide
|
||
pages → DBW=2 minimum even), and verification through the
|
||
freed-up `vram_stub` 2nd read port. Ch141 + Ch142 together
|
||
close the four-PSM × three-path × dual-driver-shape e2e
|
||
foundation — the foundation Ch143 manifests and seals.
|
||
|
||
**Hardware-demo candidates**:
|
||
|
||
- **PSMCT32 swizzled raster e2e (Ch123)** — simplest direct-
|
||
color path: 4 SPRITE PACKED packets, RGBAQ.{R,G,B,A} mapped
|
||
1:1 to scanout RGB, no CLUT, no nibble RMW. The natural first
|
||
hardware demo because every byte from EE-bootlet through the
|
||
swizzled 16×8 framebuffer to PCRTC RGB is visible without
|
||
any indirection. Build target: `make tb_gs_demo_psmct32_swizzle_e2e`.
|
||
- **PSMT4 swizzled TRXDIR e2e (Ch142)** — strongest indexed/
|
||
CLUT-like stress path: U1 PACKED A+D TRX setup + U2 IMAGE
|
||
NLOOP=4 with 32 PSMT4 nibbles per qword, image-xfer engine
|
||
decoding the canonical PCSX2 columnTable4 (which reorders
|
||
nibbles within a block — the linear `pixel_index[0]` rule is
|
||
wrong under swizzle), and per-pixel nibble RMW on vram_stub
|
||
via `write_be=4'b0001 + write_mask ∈ {0x0F, 0xF0}` keyed by
|
||
the swizzle's `nibble_hi`. Exercises the full sub-byte
|
||
pipeline + the canonical-source-locked column table. Build
|
||
target: `make tb_gs_demo_psmt4_swizzle_trxdir_e2e`.
|
||
|
||
### First hardware-targeted top wrapper (Ch146)
|
||
|
||
Ch146 turns the Ch144 readiness audit + Ch145 BRAM-shrink groundwork
|
||
into a real top-level SystemVerilog module: [`rtl/top/top_psmct32_raster_demo.sv`](../../rtl/top/top_psmct32_raster_demo.sv).
|
||
This is the module a board-level synthesis project would target
|
||
first. Board-level concerns (HDMI/VGA PHY, pin constraints, .mem
|
||
bake tooling, clock-domain crossings) are deliberately deferred —
|
||
Ch146 proves the design can be expressed as a single hardware-
|
||
shape module.
|
||
|
||
**Top ports**:
|
||
- `clk` / `rst_n` / `core_go` — clock, active-low synchronous reset,
|
||
start pulse (a board reset-release sequencer can tie `core_go`
|
||
high after `rst_n` deasserts).
|
||
- `r/g/b/hsync/vsync/de` — 8-bit RGB scanout from PCRTC.
|
||
- `core_halt` / `dma_done_seen` / `frame_seen` — debug/status bundle
|
||
suitable for LEDs or a board-level state observer.
|
||
|
||
**Top parameters**: `H_ACTIVE` (default 16), `V_ACTIVE` (default 8),
|
||
`BIOS_SIZE_BYTES`, `RAM_SIZE_BYTES`, `VRAM_BYTES`,
|
||
`USEG_SHADOW_WORDS_PARAM` (default 1024 = 4 KiB per Ch145).
|
||
|
||
**Image fixtures** are passed via macros (iverilog-12 string-
|
||
parameter forwarding limitation):
|
||
`TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE` and
|
||
`TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE`. The fixtures are
|
||
baked by [`sim/data/top_psmct32_raster_demo/bake.py`](../../sim/data/top_psmct32_raster_demo/bake.py)
|
||
which writes:
|
||
- `bios.mem` — 18-word EE bootlet (one 32-bit hex word per line)
|
||
- `payload.mem` — 40 qwords for ee_ram_stub (16 zero qwords +
|
||
24 GIF qwords carrying 4 SPRITE PACKED packets)
|
||
|
||
The bake script is a deterministic Python rewrite of the
|
||
procedural `ee_prog_word()` + `preload_qword()` loops in the
|
||
Ch123 TB. Same bit-exact values, just baked into static repo
|
||
artifacts so a hardware top can `$readmemh` them.
|
||
|
||
**Focused TB**: [`sim/tb/top/tb_top_psmct32_raster_demo.sv`](../../sim/tb/top/tb_top_psmct32_raster_demo.sv).
|
||
Drives the top with the static fixtures, captures one full
|
||
PCRTC frame after the EE halts and DMAC completes, and asserts
|
||
the per-quadrant RGB matches the Ch123 frame exactly. Counts:
|
||
`raster_emits=128, errors=0, core_halt=1, dma_done_seen=1,
|
||
frame_seen=1`.
|
||
|
||
**Bug-fix iteration**: the first bake had Y in XYZ2 placed at
|
||
bits[43:32] instead of bits[31:20] — a Python translation error
|
||
of the SystemVerilog `{32'd0, y_int, 4'd0, x_int, 4'd0}`
|
||
concatenation. Symptom: per-sprite emit count was 8 instead of
|
||
32 (each sprite drew one row), and VRAM held the per-sprite R
|
||
component scattered across 32 consecutive 4-byte cells. Caught
|
||
by adding a per-emit observer that printed
|
||
`(addr, data, be, mask, color_q)` for the first 10 emits.
|
||
Fix: `y << 20` instead of `y << 32` in `bake.py`. **PASS after
|
||
the fix.**
|
||
|
||
**What's still NOT in this chapter** (deferred to Ch147+):
|
||
- Real `.mem` bake tooling integration (currently the
|
||
`bake.py` is run manually before sim; a Makefile target or
|
||
CI step that invokes it would belong in Ch147).
|
||
- Board-specific top: pin constraints, target FPGA family,
|
||
PHY shim (HDMI/DVI/VGA), reset-release sequencer.
|
||
- A multi-PSM top (the Ch142 PSMT4 TRXDIR variant would be a
|
||
natural second wrapper once the build flow is proven).
|
||
|
||
### Fixture bake flow (Ch147)
|
||
|
||
Ch147 makes the Ch146 `.mem` bake first-class so the static
|
||
fixtures can't drift from `bake.py`. Three new Makefile targets:
|
||
|
||
| Target | Purpose |
|
||
|-----------------------------------------|-----------------------------------------------------------------------|
|
||
| `top_psmct32_raster_demo_mem` | Re-runs `bake.py`; produces `bios.mem` + `payload.mem` atomically. |
|
||
| `top_psmct32_raster_demo_mem_check` | Verifies fixture sizes (bios.mem = 1024 lines, payload.mem = 256). |
|
||
| `tb_top_psmct32_raster_demo` (existing) | Now declares `top_psmct32_raster_demo_mem` as a prerequisite. |
|
||
|
||
The bake target uses Make's grouped-target syntax (`&:`) so a
|
||
single `bake.py` run produces both files atomically — they can
|
||
never be out-of-step.
|
||
|
||
The size-check target counts payload lines (skipping blanks +
|
||
`// ...` comment-only lines) and asserts the exact expected
|
||
counts. A non-matching count exits with status 1, surfacing a
|
||
fixture/script drift as a hard build failure.
|
||
|
||
Deleting the fixtures and running the TB triggers the bake
|
||
automatically:
|
||
```
|
||
$ make tb_top_psmct32_raster_demo
|
||
=== bake top_psmct32_raster_demo .mem fixtures ===
|
||
python3 .../bake.py
|
||
[bake] wrote bios.mem (1024 words, 18 active) and payload.mem (256 qwords, 40 active)
|
||
=== build tb_top_psmct32_raster_demo ===
|
||
...
|
||
[tb_top_psmct32_raster_demo] PASS
|
||
```
|
||
|
||
#### Synthesis-facing macros
|
||
|
||
When pointing a synthesis tool at `rtl/top/top_psmct32_raster_demo.sv`,
|
||
two preprocessor defines must be set so `bios_rom_stub` and
|
||
`ee_ram_stub` find their `$readmemh` images. These are macros
|
||
(NOT module parameters) per the iverilog-12 string-parameter
|
||
forwarding workaround documented in the Ch146 wrapper banner;
|
||
they map cleanly to FPGA-tool defines.
|
||
|
||
| Macro | Value |
|
||
|----------------------------------------------------|----------------------------------------------------------------|
|
||
| `TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE` | Absolute (or tool-relative) path to `bios.mem` |
|
||
| `TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE` | Absolute (or tool-relative) path to `payload.mem` |
|
||
|
||
Both default to `""` so the wrapper still elaborates without
|
||
fixtures (synthetic NOP-sled in `bios_rom_stub` + zero-init
|
||
`ee_ram_stub`, which produces no DMAC payload but a stable
|
||
PCRTC frame with `r=g=b=0`).
|
||
|
||
**Vivado** (preprocessor `verilog_define` on the synthesis +
|
||
implementation filesets — these are macros, not module
|
||
generics):
|
||
```
|
||
set_property verilog_define { \
|
||
TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE="$path/bios.mem" \
|
||
TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE="$path/payload.mem" \
|
||
} [get_filesets sources_1]
|
||
```
|
||
Repeat for the implementation fileset if it diverges from
|
||
`sources_1`.
|
||
|
||
**Quartus** (project-level macro defines):
|
||
```
|
||
set_global_assignment -name VERILOG_MACRO \
|
||
"TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE=\"$path/bios.mem\""
|
||
set_global_assignment -name VERILOG_MACRO \
|
||
"TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE=\"$path/payload.mem\""
|
||
```
|
||
|
||
**Iverilog (sim)**: the Ch147 Makefile passes them via `-D`
|
||
flags in the `tb_top_psmct32_raster_demo` build rule —
|
||
`-DTOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE='"$(SIM_DIR)/data/...
|
||
/bios.mem"'` — and the `top_psmct32_raster_demo_mem`
|
||
prerequisite ensures the .mem files exist before the TB
|
||
elaborates.
|
||
|
||
### DE25-Nano synthesis scaffold (Ch148)
|
||
|
||
Ch148 makes the Ch146 hardware top synthesis-addressable on
|
||
DE25-Nano without committing to a video PHY shim or final pin
|
||
constraints (those land in Ch149+).
|
||
|
||
| File / target | Purpose |
|
||
|------------------------------------------------------------------|------------------------------------------------------------|
|
||
| `synth/de25_nano/top_psmct32_raster_demo/files.f` | RTL filelist — Ch123 dep tree only (~14 entries). |
|
||
| `synth/de25_nano/top_psmct32_raster_demo/README.md` | Top module + macros + fixtures + DE25-Nano clock/reset/video assumptions. |
|
||
| `make top_psmct32_raster_demo_synth_check` | Validates files.f paths + fixture presence. |
|
||
|
||
The synth-check target depends on `top_psmct32_raster_demo_mem_check`,
|
||
so a single command verifies fixture sizes AND that every file
|
||
referenced by the synth filelist exists. It exits non-zero on
|
||
any miss — surfacing both fixture drift (Ch147 size guard) and
|
||
filelist drift as hard build failures.
|
||
|
||
`.qsf` (Quartus pin assignments) is **not** committed in Ch148.
|
||
The README documents the board assumptions (clock domain,
|
||
reset polarity, `core_go` strategy, video-out path candidates,
|
||
LED status mapping) so the next chapter can author it without
|
||
inventing context. The point of Ch148 is that a Quartus project
|
||
import (or Vivado / `verilator --lint-only`) finds every file
|
||
the design needs, with the macros documented end-to-end.
|
||
|
||
### DE25-Nano board wrapper (Ch149)
|
||
|
||
Ch149 turns the Ch146 board-agnostic top into a real board top
|
||
without yet committing to pin assignments or a video PHY. New:
|
||
|
||
| Artifact | Purpose |
|
||
|-----------------------------------------------------------|------------------------------------------------------------------------|
|
||
| `rtl/top/de25_nano_psmct32_raster_demo_top.sv` | Board wrapper — DE25-Nano signal names + reset sequencer + LED status. |
|
||
| `sim/tb/top/tb_de25_nano_psmct32_raster_demo_top.sv` | Smoke TB exercising clock/reset/core_go/LED/video pins. |
|
||
|
||
**Top ports** (matching the Terasic Golden_top.v conventions
|
||
from the DE25-Nano resource CD): `CLOCK0_50` / `CLOCK1_50` /
|
||
`CLOCK2_50`, `KEY[1:0]` (active-LOW), `SW[3:0]`, `LED[7:0]`
|
||
(active-LOW), and raw `VIDEO_R/G/B/HSYNC/VSYNC/DE` outputs that
|
||
a future PHY shim will consume.
|
||
|
||
**Reset bridge**:
|
||
1. `ninit_done` sourced from Terasic's `reset_release` IP under
|
||
`\`ifdef USE_TERASIC_RESET_RELEASE_IP` (default-off; sim uses
|
||
an inline 16-cycle stub matching the IP's shape).
|
||
2. `KEY[0]` + `ninit_done` feed an async-assert/sync-deassert
|
||
2-stage shift register on CLOCK2_50. Mirrors the retroDE_nes
|
||
pattern at `retroDE_nes.sv:170-177`.
|
||
|
||
**`core_go` sequencer**: 16-cycle delay after `core_rst_n`
|
||
deasserts, then a one-cycle `core_go` pulse. Matches the
|
||
"recommended hardware path" documented in the Ch148 README and
|
||
the level-sensitive `go_i` semantics at `ee_core_stub.sv:812-813`.
|
||
|
||
**LED status**: the Ch146 wrapper's three sticky status outputs
|
||
drive `LED[2:0]` (active-LOW): `LED[0] = ~core_halt`,
|
||
`LED[1] = ~dma_done_seen`, `LED[2] = ~frame_seen`. `LED[7:3]`
|
||
tied HIGH (OFF).
|
||
|
||
**Smoke TB counts**: `core_go_pulses=1`, all three status LEDs
|
||
eventually latch (the actual fall-edge order is `frame_seen`
|
||
first, then `core_halt`, then `dma_done_seen` — `frame_seen`
|
||
is a "PCRTC alive" indicator that fires on the first empty
|
||
frame after reset, well before the bootlet runs), and
|
||
`VIDEO_DE` rises inside the active region. Standalone PASS.
|
||
|
||
`.qsf` (pin assignments), PLL, and video PHY shim remain
|
||
deferred (Ch150+). Ch149 makes the design board-shaped, not
|
||
yet board-pinned.
|
||
|
||
### Quartus scaffold for DE25-Nano (Ch150)
|
||
|
||
Ch150 commits the first real Quartus artifacts for the Ch149
|
||
board wrapper — a minimal `.qsf` + `.sdc` pair, deliberately
|
||
PHY-light:
|
||
|
||
| File | Purpose |
|
||
|-----------------------------------------------------------------|-------------------------------------------------------------------|
|
||
| `synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.qsf` | Device + family + pin assignments + IO standards + .mem macros + file list. |
|
||
| `synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc` | CLOCK2_50 50 MHz clock + reset-sync false-path + IO false-paths. |
|
||
| `make top_psmct32_raster_demo_quartus_scaffold_check` | Validates both files exist + top entity + pins + clock period. |
|
||
|
||
**Device** (sourced from `retroDE_splash.qsf`): Agilex 5
|
||
`A5EB013BB23BE4SCS`, package `VPBGA`. **Top entity**:
|
||
`de25_nano_psmct32_raster_demo_top` (the Ch149 board wrapper —
|
||
NOT the inner Ch146 module). **Pin assignments** match the
|
||
DE25-Nano board pinout used by `retroDE_splash` and
|
||
`retroDE_nes`: `CLOCK2_50` → `PIN_BF23`, `KEY[0]` → `PIN_C8`,
|
||
`LED[2:0]` → `PIN_DN22 / PIN_DJ32 / PIN_DF35`. CLOCK0/1_50,
|
||
KEY[1], SW[3:0], and LED[7:3] are also assigned (their canonical
|
||
pins) so Quartus doesn't flag them as unconstrained inputs/
|
||
outputs even though the Ch149 wrapper ties them off.
|
||
|
||
**SDC** (sourced from `retroDE_splash.sdc`): a single 50 MHz
|
||
`create_clock` on CLOCK2_50, the standard reset-sync first-stage
|
||
false-path (`set_false_path -to [get_registers -nowarn
|
||
{*rst_sync[0]}]`), and IO false paths for `KEY[*]`, `SW[*]`,
|
||
`LED[*]` plus the as-yet-unpinned `VIDEO_*` outputs (replaced
|
||
by real `set_output_delay` constraints when the PHY shim
|
||
lands).
|
||
|
||
**`.mem` macros** baked into the QSF (project-relative paths):
|
||
`TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE = sim/data/top_psmct32_raster_demo/bios.mem`
|
||
and the matching payload macro. Run `make -C sim
|
||
top_psmct32_raster_demo_mem` before launching Quartus.
|
||
|
||
**`USE_TERASIC_RESET_RELEASE_IP`** is **not** defined in this
|
||
QSF — keeping the wrapper self-contained for the first project
|
||
import. To wire in Terasic's `reset_release` IP, define the
|
||
macro and add the IP file from
|
||
`DE25_Nano_ResourceCD/Demonstration/FPGA/Board_Info_RTL/reset_release/`.
|
||
|
||
**Deferred to Ch151+**: video PHY pins + shim (HDMI ADV7513 +
|
||
I²C config FSM, VGA DAC, or PMOD), PLL `.ip` config, LPDDR4 /
|
||
SDRAM / HPS / CAM / UART / GPIO assignments. Ch150 makes the
|
||
project Quartus-importable, not yet Quartus-buildable for video
|
||
output.
|
||
|
||
### PLL + lock-gated reset (Ch151)
|
||
|
||
Ch151 adds the most conservative hardware bring-up step before
|
||
touching the video PHY: a board-clock PLL on the path between
|
||
`CLOCK2_50` and the design clock, with the reset bridge gated
|
||
on PLL lock so the design can only leave reset once the PLL is
|
||
stable.
|
||
|
||
| Artifact | Purpose |
|
||
|-------------------------------------------------------|----------------------------------------------------------------------|
|
||
| `rtl/top/de25_nano_pll_stub.sv` | Sim stub matching the Quartus IOPLL `pll` module signature. |
|
||
| `rtl/top/de25_nano_psmct32_raster_demo_top.sv` (Ch151) | Reworked with PLL instantiation + lock-gated reset bridge + `design_clk` distribution to the Ch146 wrapper and `core_go` sequencer. |
|
||
| `tb_de25_nano_psmct32_raster_demo_top` (Ch151 update) | Adds rising-edge timestamps for `pll_locked` / `core_rst_n` / `core_go` and asserts the contract `pll_locked < core_rst_n < core_go`. |
|
||
|
||
**PLL signature** (matches `retroDE_nes/ip/pll/pll_bb.v` and
|
||
`retroDE_splash/ip/sys_pll/sys_pll_bb.v`):
|
||
```
|
||
module pll (
|
||
input wire refclk,
|
||
input wire rst,
|
||
output wire outclk_0,
|
||
output wire locked
|
||
);
|
||
```
|
||
|
||
**Sim stub behavior**: `outclk_0 = refclk` (pass-through, no
|
||
multiplication — sim doesn't need a different frequency, and a
|
||
pass-through still exercises the lock-gated reset bridge).
|
||
`locked` rises after 32 cycles with `rst` low; held LOW while
|
||
`rst` is HIGH.
|
||
|
||
**Reset gating**: the board top's `rst_sync` register
|
||
async-asserts on `(ninit_done | ~pll_locked)` — both FPGA init
|
||
AND PLL lock must complete before reset can deassert.
|
||
|
||
**Synth swap**: define `USE_PLL_IP` and add a Quartus IOPLL
|
||
`.qip` to the project; the board wrapper's `\`ifdef USE_PLL_IP`
|
||
swaps the stub for the real IP. The QSF documents the swap
|
||
mechanism but ships with the IP commented out, keeping the
|
||
scaffold self-contained until the PLL chapter (Ch152+) commits
|
||
a frequency choice + IP file.
|
||
|
||
**TB contract** (smoke output): `t_pll/rstn/go=(950000,990000,
|
||
1330000)` ns — PLL locks at 950 ns, reset deasserts 40 ns
|
||
later (the 2-stage sync register prop), `core_go` fires
|
||
340 ns later (the GO_DELAY=16 wait). Order assertions catch
|
||
any future regression of the gating.
|
||
|
||
**Deferred to Ch152+**: real PLL output frequency tuning (the
|
||
stub passes refclk through; a real build sets `outclk_0` to
|
||
whatever the video PHY chapter needs), committing the actual
|
||
IOPLL `.ip` file under `synth/de25_nano/.../ip/`, the video
|
||
PHY shim itself.
|
||
|
||
### First Quartus compile + baseline report (Ch152)
|
||
|
||
Ch152 is the chapter where the toolchain is finally asked the
|
||
honest question: "does this DE25-Nano board top synthesize, fit,
|
||
and pass static timing analysis?"
|
||
|
||
**Driver**: [`synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh`](../../synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh)
|
||
runs `quartus_syn → quartus_fit → quartus_sta` against the Ch150
|
||
QSF + Ch151 PLL stub. `quartus_asm` (bitstream gen) is
|
||
deliberately skipped — Ch152 is a compile-and-report smoke,
|
||
not a deploy path. `USE_PLL_IP` is left undefined so the Ch151
|
||
self-contained PLL stub stays under test (per Codex framing).
|
||
|
||
**Make targets**:
|
||
| Target | Action |
|
||
|---------------------------------|-------------------------------------------------------------|
|
||
| `make quartus_compile` | Full syn + fit + sta flow. |
|
||
| `make quartus_compile_clean` | Wipe outputs first, then full flow. |
|
||
| `make quartus_syn_only` | Synthesis only (~14 min smoke). |
|
||
| `make quartus_compile_report` | Run [`parse_reports.py`](../../synth/de25_nano/top_psmct32_raster_demo/parse_reports.py) on the latest output. |
|
||
|
||
**Ch152 RTL fixes that landed before synthesis would even
|
||
elaborate**:
|
||
|
||
| Issue | Fix |
|
||
|------------------------------------------------------------------------------------|------------------------------------------------------------------------------|
|
||
| QSF line-continuation (`\`) parse error in `set_global_assignment -name VERILOG_MACRO` | Collapsed to single-line lines. |
|
||
| `vram_stub.mem` 8192-iter init loop exceeded Quartus's 5000-iter synthesizable-loop limit (Error 13356) | Wrapped initial block in `// synthesis translate_off` / `_on` pragmas. Real Altera/Intel BRAM is power-on-zero so the procedural loop is sim-only. |
|
||
| `gs_pcrtc_stub` / `gif_image_xfer_stub` / `gs_stub` unconditionally instantiate all four swizzle math primitives even when their gate is 0 | Added `gs_swizzle_psmct16/8/4_stub.sv` to the synth filelist + QSF (iverilog trimmed silently; Quartus errors). |
|
||
| `gs_stub.interp_byte` (Ch86 Gouraud TRI math) 64-bit signed divide hits Quartus Pro's lpm_divide LPM_WIDTHN ≤64 limit (Error 272006) | Wrapped divide in `// synthesis translate_off`; default fallback returns 0. The Ch123 SPRITE-only demo doesn't exercise Gouraud TRIs, so this is dead code in the build. A future Gouraud-TRI hardware demo would need a divider redesign sized for Agilex 5. |
|
||
| QSF `SDC_FILE` referenced via repo-root-relative path failed when the build script ran Quartus from a per-build work dir (Warning 16124) | Changed to basename-only — works from either the repo root or the work dir (the script symlinks the SDC alongside the QSF). |
|
||
|
||
**First successful synthesis**: 0 errors, 3 warnings, 14:08
|
||
elapsed. 160 RAM segments + 26 DSP elements inferred.
|
||
|
||
**Fitter result — design too large for the part (the chapter's
|
||
honest answer)**:
|
||
|
||
```
|
||
Total dedicated logic registers : 121,176
|
||
Total pins : 17 / 351 ( 5 %)
|
||
Total block memory bits : 65,536 / 7,331,840 (<1 %)
|
||
Total RAM Blocks : 6 / 358 ( 2 %)
|
||
Total DSP Blocks : 20 / 188 (11 %)
|
||
Logic utilization (ALMs needed) : 155,104 / 46,800 (331 %)
|
||
```
|
||
|
||
The design needs **155,104 ALMs vs the part's 46,800 — 3.31×
|
||
oversized**. `Error (170011): Design contains 260,263 blocks of
|
||
type combinational node. However, the device contains only
|
||
93,600 blocks.`
|
||
|
||
**Why so big** (the precise picture, to be drilled into by Ch153+):
|
||
|
||
The synthesis log reports `Info (22567): extracting RAM` for
|
||
**all four** memory identifiers — `ee_ram_stub.mem`,
|
||
`bios_rom_stub.mem`, `ee_memory_map_stub.useg_shadow_mem`, and
|
||
`vram_stub.mem` — so Quartus *did* recognize each as a memory
|
||
structure at syn time. But the fit report shows only **65,536
|
||
bits / 6 RAM Blocks** committed (roughly enough for BIOS 4 KB +
|
||
EE-RAM 4 KB). Something between syn and fit caused the larger
|
||
arrays — most likely `vram_stub.mem` (8 KB) and possibly
|
||
`useg_shadow_mem` (4 KB after Ch145's 1024-word shrink) — to
|
||
either (a) be replicated into combinational mux/decoder logic
|
||
because of their access-port shape, or (b) lose their RAM
|
||
attribute during fitter optimization and fall back to
|
||
flip-flop implementation. The 121,176 dedicated registers + the
|
||
260,263 combinational nodes are consistent with at least
|
||
`u_vram` getting massively unrolled.
|
||
|
||
Ch153's job is to isolate **which array(s)** and **which port
|
||
shape(s)** prevent compact block-RAM implementation. The
|
||
likely candidates: `vram_stub`'s dual read ports + per-byte
|
||
write_be lane (Ch95's per-byte gate may not be RAM-block-
|
||
friendly on Agilex 5), and the EE memory map's wide arbitration
|
||
into the useg-shadow port. None of this is fixed in Ch152 —
|
||
surfacing the gap precisely is the chapter's deliverable.
|
||
|
||
**Other notable findings** (full list in
|
||
[`output_files/build_logs/`](../../synth/de25_nano/top_psmct32_raster_demo/output_files/build_logs/)):
|
||
- **Critical Warning 20759**: "Use the Reset Release IP in
|
||
Agilex 5 designs to ensure a successful configuration." This
|
||
is the Ch151 `\`ifdef USE_TERASIC_RESET_RELEASE_IP` opt-in;
|
||
enabling it (and committing the IP file) is a Ch153+ task.
|
||
- **6× Warning 16749**: identifiers used before declaration in
|
||
`dmac_reg_stub`, `gif_packed_stub`, `gs_stub`,
|
||
`gif_image_xfer_stub`. Style/lint warnings, no functional
|
||
impact; clean-up candidate for a future polish chapter.
|
||
- **STA never ran** because fit failed.
|
||
|
||
**What Ch152 leaves for Ch153+**:
|
||
- Resource reduction. Most likely candidates: BRAM-infer
|
||
`vram_stub.mem` and `useg_shadow_mem` cleanly (Quartus
|
||
attribute hints / restructure read ports), or shrink the EE
|
||
core's MIPS decode (table-driven vs LUT-driven), or move to
|
||
a larger Agilex 5 part if available.
|
||
- Enabling `USE_TERASIC_RESET_RELEASE_IP` and committing the
|
||
Terasic `reset_release` IP file.
|
||
- The PHY shim chapter (`VIDEO_*` virtualized → real HDMI
|
||
ADV7513 / VGA / PMOD pins).
|
||
- Cleaning up the 6× forward-reference style warnings.
|
||
|
||
### Memory-shape forensics (Ch153)
|
||
|
||
Ch153 is a memory-forensics chapter (NOT a rewrite chapter): two
|
||
isolated tiny Quartus projects under [`synth/de25_nano/experiments/`](../../synth/de25_nano/experiments/)
|
||
target the same Agilex 5 part as the Ch150 board top so resource
|
||
deltas are apples-to-apples. The goal is to identify which feature(s)
|
||
of `vram_stub`'s shape prevent compact block-RAM implementation and
|
||
drive the Ch152 size deficit.
|
||
|
||
| Experiment | Memory shape |
|
||
|-----------------------|-----------------------------------------------------------------------------------------------|
|
||
| `exp_a_bram_friendly` | 2048 × 32-bit, single port, sync read + sync write with byte-WE. Intel-friendly BRAM template. |
|
||
| `exp_b_vram_shape` | 8192 × 8-bit, dual COMBINATIONAL read, byte-WE + per-bit mask RMW. Exact `vram_stub` shape. |
|
||
|
||
**The result is decisive**:
|
||
|
||
| Metric | exp_a (BRAM-friendly) | exp_b (vram_stub-shape) |
|
||
|---------------------------------|-----------------------|-------------------------|
|
||
| Fitter status | ✅ **Successful** | ❌ **Failed** |
|
||
| Logic utilization (ALMs) | **46** / 46,800 (< 1 %) | (fit failed — placement reports 257,986 combinational nodes vs 93,600 device max) |
|
||
| Total dedicated logic registers | **0** | **65,536** |
|
||
| Total RAM Blocks | **4** / 358 (1 %) | **0** / 358 (0 %) |
|
||
| Total block memory bits | **65,536** (8 KB) | **0** |
|
||
|
||
**Interpretation**:
|
||
- The Intel-friendly shape maps the same 8 KB to **4 RAM Blocks**
|
||
with **zero combinational logic and zero registers** beyond the
|
||
read-output flop.
|
||
- The `vram_stub` shape maps the same 8 KB to **zero RAM Blocks**,
|
||
**65,536 dedicated registers** (one flip-flop per byte), and
|
||
**257,986 combinational nodes** (the 4-byte concatenation
|
||
multiplexers for the dual combinational reads + the per-bit
|
||
mask RMW gates).
|
||
- The 257,986 combinational-node figure for a single 8 KB memory
|
||
almost exactly matches the 260,263 combinational-node figure
|
||
Ch152 reported for the **entire top-wrapper design** —
|
||
empirical confirmation that `u_vram` alone accounts for
|
||
essentially all of the Ch152 size deficit.
|
||
|
||
**Which feature is the dominant cost** (the four candidates the
|
||
shape diff isolates):
|
||
|
||
The exp_a vs exp_b diff folds four feature changes together
|
||
(byte-addressable storage, combinational reads, dual reads,
|
||
per-bit mask RMW). To pin down which feature(s) dominate, a
|
||
future chapter could insert intermediate experiments — but the
|
||
exp_a result already gives the upper bound on what BRAM-native
|
||
inference can buy: ~4 RAM blocks + ~50 ALMs for 8 KB. Anything
|
||
that gets `vram_stub` close to that bar wins back the entire
|
||
Ch152 fit headroom.
|
||
|
||
The most likely individual culprit is the **per-bit mask RMW**:
|
||
Agilex 5's M20K BRAM has byte-WE primitives but does NOT have
|
||
per-bit RMW. Quartus has to materialize the
|
||
`(mem & ~mask) | (data & mask)` arithmetic outside the BRAM,
|
||
which forces the storage out of BRAM and into per-bit flip-flops.
|
||
Combinational reads are the second most likely (BRAMs are
|
||
synchronous-read-only on Agilex 5; Quartus has to either insert
|
||
a register on the read path or materialize the storage as
|
||
discrete flip-flops to feed the comb output).
|
||
|
||
**Make targets**:
|
||
|
||
| Target | Action |
|
||
|---------------------------------------|--------------------------------------------------------------|
|
||
| `make quartus_experiments` | Compile every `synth/.../experiments/exp_*` project. |
|
||
| `make quartus_experiments_clean` | Wipe outputs first, then compile. |
|
||
| `make quartus_experiments_report` | Side-by-side resource summary (no recompile). |
|
||
|
||
**What Ch153 leaves for Ch154+**:
|
||
- Refactor `vram_stub` into a BRAM-friendly shape: replace
|
||
combinational reads with sync (registered output) reads,
|
||
replace per-bit mask RMW with byte-WE-only writes (move the
|
||
per-pixel sub-byte merging logic into the writer module —
|
||
most likely `gs_stub.raster_pixel_emit` for the PSMT4 nibble
|
||
case), and switch to 32-bit word-addressable storage with
|
||
byte-WE for the unaligned-byte case.
|
||
- Audit `useg_shadow_mem` next — it had `Info (22567): extracting RAM`
|
||
at synthesis but didn't survive to fit. Likely culprits there:
|
||
the `Ch64` / `Ch65` / `Ch70` mirror-write features that turn
|
||
the simple useg-shadow into a multi-port write structure.
|
||
|
||
### BRAM-friendly vram sibling (Ch154)
|
||
|
||
Ch154 adds a hardware-friendly sibling of `vram_stub` —
|
||
[`rtl/gif_gs/vram_bram_stub.sv`](../../rtl/gif_gs/vram_bram_stub.sv) — that maps cleanly onto Agilex 5
|
||
M20K block-RAM. Per Codex's framing, the chapter's blast radius
|
||
stays narrow: **add the sibling + prove it works + measure the
|
||
BRAM-inference win**. The actual swap of the board top to use
|
||
the new module + the writer-side PSMT4 nibble-RMW rework lands
|
||
in Ch155+.
|
||
|
||
**`vram_bram_stub` shape vs `vram_stub`**:
|
||
|
||
| Feature | `vram_stub` (legacy / sim reference) | `vram_bram_stub` (Ch154, hw-friendly) |
|
||
|----------------------------|-------------------------------------|----------------------------------------|
|
||
| Storage | 8192 × 8-bit byte-addressable | 2048 × 32-bit word-addressable |
|
||
| Reads | Combinational; arbitrary alignment | Synchronous (1-cycle); word-aligned only |
|
||
| Read ports | 2 (combinational) | 2 (sync, true dual-port M20K) |
|
||
| Write granularity | byte-WE + per-bit `write_mask` RMW | byte-WE only |
|
||
| Per-bit mask RMW (Ch106) | yes — supports PSMT4 nibble splice | NO — caller must splice on writer side |
|
||
|
||
**New equivalence TB**: [`tb_vram_bram_stub_equivalence`](../../sim/tb/gif_gs/tb_vram_bram_stub_equivalence.sv).
|
||
Drives both DUTs in lockstep with byte-WE-only writes
|
||
(`write_mask = 0xFFFFFFFF` on the legacy module so the per-bit
|
||
RMW path is a no-op), aligns sample times across the new
|
||
module's 1-cycle sync-read latency, and asserts data
|
||
equivalence across:
|
||
- 32-bit word writes (`be=4'b1111`)
|
||
- per-byte-lane writes (`be=4'b0001 / 0010 / 0100 / 1000`)
|
||
- per-byte non-wrapping admission near MAX_BASE
|
||
- dual-port read agreement
|
||
|
||
PASS standalone + in the full sim regression.
|
||
|
||
**Quartus experiment `exp_c_vram_bram_stub`** ([synth/.../experiments/exp_c_vram_bram_stub/](../../synth/de25_nano/experiments/exp_c_vram_bram_stub/))
|
||
proves the new module infers BRAM cleanly. Side-by-side with
|
||
the Ch153 baselines, all on the same Agilex 5 part:
|
||
|
||
| Experiment | Fit | ALMs | Registers | RAM Blocks | Block memory bits |
|
||
|------------------------|-----------|------|-----------|------------|-------------------|
|
||
| `exp_a_bram_friendly` | ✅ Success | **46** / 46,800 | **0** | **4** / 358 | 65,536 |
|
||
| `exp_b_vram_shape` | ❌ Failed | (261,578 comb nodes vs 93,600 device max) | **65,536** | **0** / 358 | 0 |
|
||
| `exp_c_vram_bram_stub` | ✅ Success | **190** / 46,800 | **2** | **8** / 358 | 131,072 |
|
||
|
||
**Interpretation**:
|
||
- `exp_c` lands close to `exp_a`'s ideal (190 vs 46 ALMs; 8 vs
|
||
4 RAM Blocks). The slight overhead vs `exp_a` is the dual
|
||
read port (M20K replicates storage to serve two independent
|
||
read addresses simultaneously, hence 2× block memory bits)
|
||
plus the per-byte non-wrapping admission gate Ch95 inherited
|
||
from `vram_stub`.
|
||
- `exp_c` consumes **3.4× fewer** dedicated registers than
|
||
`exp_a` would have if `read_data` was reset (2 vs the 32 a
|
||
reset would require) — the canonical Quartus inference
|
||
template demands no reset on the BRAM data register.
|
||
- vs `exp_b`'s **65,536 registers + 261,578 combinational nodes**,
|
||
swapping `vram_stub` → `vram_bram_stub` recovers essentially
|
||
all of the Ch152 ALM headroom on the vram side. Useg-shadow
|
||
is the next forensic target (likely similar shape).
|
||
|
||
**Inference template gotcha** (caught + fixed in this chapter):
|
||
the first cut of `vram_bram_stub` had a reset on `read_data`
|
||
inside the always_ff block AND an in-bounds gate guarding the
|
||
`mem` read. Quartus rejected BRAM inference with
|
||
`Info (276007): RAM logic ... uninferred due to asynchronous
|
||
read logic`. Fix: simplified the read path to the canonical
|
||
template (`always_ff @(posedge clk) read_data <= mem[idx];`)
|
||
and moved bounds + alignment checks to a parallel `read_valid`
|
||
pipeline. Then `Implemented 64 RAM segments` instead of 0.
|
||
|
||
**Ch155+ surface — writer-side normalization for ALL sub-32-bit
|
||
PSMs, not just PSMT4**: `vram_bram_stub`'s contract is stricter
|
||
than `vram_stub`'s — `write_addr` MUST be word-aligned
|
||
(`write_addr[1:0] == 2'b00`), and the byte lane(s) being written
|
||
are selected via `write_be` with the payload pre-shifted into
|
||
the right byte lane(s) of `write_data[31:0]`. Today's writer-
|
||
side RTL emits at sub-word boundaries:
|
||
- **PSMCT16** raster + image-xfer write at halfword addresses
|
||
(`write_addr[1] == 1` for the high halfword) with `be=4'b0011`
|
||
or `4'b1100` and the 16-bit payload in `write_data[15:0]`.
|
||
- **PSMT8** raster + image-xfer write at byte addresses
|
||
(any `write_addr[1:0]`) with `be=4'b0001` and the 8-bit payload
|
||
in `write_data[7:0]`.
|
||
- **PSMT4** raster + image-xfer write at byte addresses with
|
||
`be=4'b0001` + per-bit `write_mask` 0x0F or 0xF0 to splice
|
||
one nibble.
|
||
- **PSMCT32** raster + image-xfer write at word addresses with
|
||
`be=4'b1111` + the full 32-bit payload — the ONLY PSM that
|
||
natively matches `vram_bram_stub`'s contract today.
|
||
|
||
If we swap the board top to `vram_bram_stub` without writer-side
|
||
normalization, **CT16/T8/T4 writes silently drop** because
|
||
`write_addr[1:0] != 0` fails admission. So Ch155 must rework
|
||
each writer to:
|
||
1. Mask `write_addr` down to its word base (`write_addr & ~32'd3`).
|
||
2. Shift the payload from its native byte lane into the
|
||
appropriate byte lane(s) of a 32-bit `write_data` based on
|
||
the original `write_addr[1:0]`.
|
||
3. Generate `write_be` with bits set only for the byte lanes
|
||
the original sub-word address actually targets.
|
||
4. **For PSMT4 specifically**: replace the per-bit `write_mask`
|
||
nibble splice with a writer-side read-modify-write — read
|
||
the existing byte first, splice the new nibble in, then
|
||
issue a normal byte-WE write. Adds ~1 cycle of latency per
|
||
nibble-write but that's well within the 16×8 demo budget.
|
||
|
||
The rework lands inside `gs_stub.raster_pixel_emit` (Ch95/Ch105/
|
||
Ch106 wrote the legacy paths) and `gif_image_xfer_stub`'s per-
|
||
PSM dispatch. A focused TB that drives sub-word writes through
|
||
the normalizer and asserts the resulting `vram_bram_stub` words
|
||
match the legacy `vram_stub` byte-/halfword-/nibble-level
|
||
state would be the cleanest proof.
|
||
|
||
**Other Ch155+ work**:
|
||
- Update scanout / debug TBs that sample VRAM via vram_stub's
|
||
combinational reads to handle the 1-cycle sync-read latency
|
||
(or keep them on `vram_stub` if they're sim-only).
|
||
- Swap the Ch146 board top to instantiate `vram_bram_stub`
|
||
AFTER the writer-side normalization lands. Rerun the full
|
||
Quartus compile and expect a dramatic ALM/register reduction.
|
||
- Audit `useg_shadow_mem` next — Ch64/Ch65/Ch70 mirror-write
|
||
features may make it multi-port-write-shaped.
|
||
|
||
### VRAM write normalizer + first BRAM integration (Ch155)
|
||
|
||
Ch155 lands the writer-side normalization layer that bridges
|
||
the contract gap between the legacy `vram_stub` (byte-addressed
|
||
sub-word writes + per-bit RMW) and the new `vram_bram_stub`
|
||
(word-aligned + byte-WE only). Per Codex's framing the chapter
|
||
keeps blast radius narrow: build the normalizer + verify it
|
||
standalone for all 4 PSMs + prove the easiest case (PSMCT32)
|
||
end-to-end through the new VRAM. RTL plumbing into
|
||
`gs_stub.raster_pixel_emit` and `gif_image_xfer_stub` lands in
|
||
Ch156+.
|
||
|
||
| Artifact | Purpose |
|
||
|------------------------------------------------------------|------------------------------------------------------------------|
|
||
| `rtl/gif_gs/vram_normalize_pkg.sv` | Pure-comb `normalize_write` function — natural byte address + PSM + payload + (T4-only) old_byte → word-aligned write_addr + shifted write_data + write_be. |
|
||
| `tb_vram_normalize_write` | Focused unit TB — 17 cases across CT32 / CT16 / T8 / T4 lanes + misuse detection. |
|
||
| `rtl/top/top_psmct32_raster_demo_bram.sv` | Sibling of the Ch146 wrapper with `vram_bram_stub` swapped in. |
|
||
| `tb_top_psmct32_raster_demo_bram` | Integration TB — drives Ch146 fixtures + verifies VRAM contents at PSMCT32 swizzled addresses via hierarchical probe. |
|
||
|
||
**Function contract** (`vram_normalize_pkg::normalize_write`):
|
||
|
||
| PSM | byte_addr alignment | payload bits used | output `write_be` shape | extras |
|
||
|-----------|---------------------|-------------------|-------------------------|--------|
|
||
| PSMCT32 | word (`addr[1:0]==0`) | `payload[31:0]` (full ABGR) | `4'b1111` | misuse → drop (`be=0000`) |
|
||
| PSMCT16 | halfword (`addr[0]==0`) | `payload[15:0]` (RGB5A1) | `4'b0011` (low) / `4'b1100` (high), keyed on `addr[1]` | misuse → drop |
|
||
| PSMT8 | byte (any) | `payload[7:0]` (index byte) | one of `4'b0001 / 0010 / 0100 / 1000`, keyed on `addr[1:0]` | — |
|
||
| PSMT4 | byte (any) | `payload[3:0]` (nibble) | one of `4'b0001 / 0010 / 0100 / 1000`, keyed on `addr[1:0]` | needs `old_byte` + `nibble_hi`; output is the spliced full byte at the addressed lane |
|
||
| any other | — | — | `4'b0000` | — |
|
||
|
||
**PSMT4 splice math** (the only PSM whose output depends on
|
||
prior memory state): given `nibble_hi=0`, the function returns
|
||
`new_byte = {old_byte[7:4], payload[3:0]}` — preserves the
|
||
upper nibble, replaces the lower. With `nibble_hi=1`,
|
||
`new_byte = {payload[3:0], old_byte[3:0]}`. The CALLER is
|
||
responsible for sourcing `old_byte` via a 1-cycle read of
|
||
`mem[byte_addr]` upstream of the write; the function itself is
|
||
purely combinational. The Ch156+ RTL plumbing chapter is
|
||
where that read pipeline lives inside
|
||
`gs_stub.raster_pixel_emit` and `gif_image_xfer_stub`.
|
||
|
||
**`top_psmct32_raster_demo_bram` integration result**: the new
|
||
sibling wrapper substitutes `vram_bram_stub` for `vram_stub`,
|
||
drops `write_mask` wiring (CT32's `mask=0xFFFFFFFF` makes the
|
||
per-bit RMW path a no-op so dropping it is functionally
|
||
equivalent), and accepts the 1-cycle sync-read latency on
|
||
PCRTC's `vram_read_data` path (so PCRTC scanout is 1-pixel
|
||
shifted; the integration TB skips frame capture and verifies
|
||
VRAM content via direct hierarchical probe). All 128 pixel
|
||
words at canonical PSMCT32 swizzled addresses match expected
|
||
ABGR. Standalone PASS.
|
||
|
||
**Ch155 critical audit check**: `vram_normalize_write`'s
|
||
function-level misuse handling pins the contract — passing an
|
||
unaligned `byte_addr` for CT32 OR CT16 returns `write_be=4'b0000`,
|
||
which `vram_bram_stub` then drops cleanly. Combined with
|
||
Codex's stance that "no sub-32-bit writer is allowed to hand
|
||
an unaligned address directly to vram_bram_stub", the Ch156+
|
||
plumbing chapter has a hard contract to verify against.
|
||
|
||
**Ch156+ surface**:
|
||
- Insert a 1-cycle byte-read pipeline upstream of the PSMT4
|
||
raster emit + image-xfer paths inside `gs_stub` and
|
||
`gif_image_xfer_stub`. The read returns `old_byte` for
|
||
`normalize_write`'s splice input.
|
||
- Apply `normalize_write` to all four PSM emit lanes inside
|
||
both writers.
|
||
- Add focused TBs for PSMCT16 / PSMT8 / PSMT4 paths analogous
|
||
to `tb_top_psmct32_raster_demo_bram` — each verifies the
|
||
swizzled VRAM contents under the new normalizer + bram_stub.
|
||
- Add a 1-cycle address-stage register inside
|
||
`gs_pcrtc_stub` so scanout consumers see a clean
|
||
combinational-look read (`addr` → `data` with the BRAM's
|
||
internal sync stage hidden).
|
||
- Once all four lanes pass, swap the Ch146 board top to use
|
||
`vram_bram_stub` directly (or retire `vram_stub` outright).
|
||
- Audit `useg_shadow_mem` next — the Ch64/Ch65/Ch70 mirror-
|
||
write features may make it multi-port-write-shaped, which
|
||
is its own forensic exercise.
|
||
|
||
### Writer-side normalize plumbing — CT16 + T8 (Ch156)
|
||
|
||
Ch156 plumbs the Ch155 `vram_normalize_pkg::normalize_write`
|
||
function into the BRAM-friendly path so PSMCT16 and PSMT8
|
||
raster emits land at the right `vram_bram_stub` byte lane. The
|
||
chapter intentionally keeps blast radius narrow — the function
|
||
is wired in at the **wrapper site** between the unmodified
|
||
writer engines (`gs_stub.raster_pixel_emit`) and
|
||
`vram_bram_stub`, so the legacy byte-addressable contract on
|
||
gs_stub's raster emit ports stays exactly as Ch128/Ch134 / etc.
|
||
defined them. PSMT4 still requires the read-modify-write
|
||
pipeline and is deferred to Ch157+.
|
||
|
||
| File / target | Role |
|
||
| ---------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- |
|
||
| `rtl/top/top_psmct32_raster_demo_bram.sv` | Wrapper updated: `raster_pixel_psm_q` exposed; `bitbltbuf_q[61:56]` provides the PSM during xfer; the muxed `(byte_addr, psm, payload)` triple is run through `vram_normalize_pkg::normalize_write` and the result feeds `vram_bram_stub`. CT32 path remains a passthrough; CT16/T8 paths now write to the right lane. |
|
||
| `tb_gs_raster_bram_psmct16` | Focused CT16 integration TB — 16×4 SPRITE at FBP=0/FBW=1, halfword 0x6155. Drives gs_stub#(PSMCT16_SWIZZLE=1) directly; verifies all 64 swizzled halfwords land in `u_vram.mem[byte_addr >> 2]` at the addr[1]-keyed lane; pins the linear-stride separator at byte 0x80 = zero. |
|
||
| `tb_gs_raster_bram_psmt8` | Focused PSMT8 integration TB — 16×8 SPRITE at FBP=0/FBW=2, byte index 0xA5. Drives gs_stub#(PSMT8_SWIZZLE=1) directly; verifies all 128 swizzled bytes land in `u_vram.mem[byte_addr >> 2]` at the addr[1:0]-keyed lane. |
|
||
|
||
**Why wrapper-site, not in-engine**: keeping `gs_stub` and
|
||
`gif_image_xfer_stub` byte-addressable preserves the contract
|
||
that every Ch128 / Ch134 / Ch140 swizzle TB (and the legacy
|
||
`vram_stub`) was written against. Ch156's only structural
|
||
change is that a top wrapper which targets `vram_bram_stub`
|
||
also runs `normalize_write` between the writer and VRAM. A
|
||
future chapter can promote the normalizer into the writer
|
||
engines once we've decided to retire `vram_stub`; until then
|
||
the function lives where it can be removed without changing
|
||
the writers.
|
||
|
||
**PSMT4 deferral — explicit hard-gate** (Ch156 audit Medium #1
|
||
fix; **superseded by Ch157**): when Ch156 closed, the wrapper
|
||
masked `write_en` off when the active PSM was PSMT4
|
||
(`vram_psmt4_block = (vram_psm_pre == PSM_PSMT4)`,
|
||
`vram_we_mux = vram_we_pre && !vram_psmt4_block`). Without that
|
||
gate, `normalize_write`'s PSMT4 branch returned a real one-byte
|
||
write spliced against `old_byte=0`, silently corrupting VRAM
|
||
on any T4 raster emit. The Ch156 focused TB
|
||
`tb_gs_raster_bram_psmt4_gate` drove a 16×4 PSMT4 SPRITE
|
||
through the wrapper-shape gate and asserted (1) raster_pixel_emit
|
||
pulses fired, (2) every pulse hit the gate (`blocked == emit`),
|
||
(3) VRAM stayed at sentinel 0xDEADBEEF — zero corruption.
|
||
**Ch157 retires both the gate and that TB**: the wrapper now
|
||
runs a real RMW pipeline (see "PSMT4 RMW pipeline" section
|
||
below) and supplies a live `old_byte` so the splice produces
|
||
correct bytes. The retired TB's coverage is replaced by
|
||
`tb_gs_raster_bram_psmt4`, which drives the same kind of PSMT4
|
||
SPRITE but verifies *correct* nibble splices instead of
|
||
*absence* of writes.
|
||
|
||
**Adversarial coverage on the CT16 / PSMT8 TBs** (Ch156 audit
|
||
Medium #2 fix): both TBs originally drove a single uniform
|
||
payload across the whole sprite, so a buggy normalizer that
|
||
wrote all four byte lanes (or duplicated payload, or stomped
|
||
neighboring lanes) could still leave every checked pixel
|
||
matching. The TBs now split the image into TWO half-width
|
||
SPRITEs with **distinct** payloads:
|
||
- `tb_gs_raster_bram_psmct16` drives `(0,0)..(7,3)` with
|
||
halfword 0x6155 (low halfword lane via PSMCT16 swizzle) and
|
||
`(8,0)..(15,3)` with halfword 0x9F8E (high halfword lane of
|
||
the same 32-bit words). Sentinel preload (0xDEADBEEF) on
|
||
every VRAM word before the drive plus a linear-stride
|
||
separator check at byte 0x80 (outside the swizzled set).
|
||
- `tb_gs_raster_bram_psmt8` drives `(0,0)..(7,7)` with byte
|
||
0xA5 (lanes {0,1}) and `(8,0)..(15,7)` with byte 0x5A
|
||
(lanes {2,3}). Same sentinel preload.
|
||
|
||
A normalizer that swaps lanes, sets be too wide, or fails to
|
||
preserve the other halfword/byte lane(s) of the shared word
|
||
now surfaces as a per-pixel mismatch.
|
||
|
||
**Sim regression**: 141 PASS / 0 FAIL after the audit fixes
|
||
(140 + the new `tb_gs_raster_bram_psmt4_gate`).
|
||
|
||
**xfer-side coverage**: `gif_image_xfer_stub` already feeds
|
||
the wrapper's pre-normalize mux during `xfer_busy`. CT32
|
||
TRXDIR uploads (no Ch156 TB exists yet, but the path is
|
||
wired) pass through the normalizer cleanly because xfer
|
||
emits CT32 word-aligned. CT16 + T8 xfer TBs that exercise
|
||
this path are a follow-on item — the wiring is already in
|
||
place; only a focused TB is missing.
|
||
|
||
**Sim regression**: 140 PASS / 0 FAIL after Ch156 (138 +
|
||
2 new BRAM-integration TBs).
|
||
|
||
### PSMT4 RMW pipeline — `vram_bram_stub` writes enabled (Ch157)
|
||
|
||
Ch157 closes the last writer-PSM gap that Ch156 left behind: the
|
||
PSMT4 hard-gate is replaced by a wrapper-site read-modify-write
|
||
pipeline that supplies a LIVE `old_byte` from VRAM, splices the
|
||
new nibble against it, and commits a full-byte write through
|
||
`vram_bram_stub`'s byte-WE (no per-bit RMW required). The
|
||
nibble splice itself uses the SAME math as `vram_normalize_pkg`'s
|
||
PSMT4 branch (`new = nibble_hi ? {nib, old[3:0]} : {old[7:4], nib}`)
|
||
but lives **inline in the wrapper**, not inside a call to
|
||
`normalize_write` — the function is pure-comb and would have
|
||
required `old_byte` to be combinationally available, whereas
|
||
`vram_bram_stub`'s registered read port hands the byte back one
|
||
cycle later. The CT32/CT16/T8 paths still call `normalize_write`
|
||
directly (same-cycle, no read-back required). Goal Codex framed:
|
||
"all writer PSMs safe before swapping the board top."
|
||
|
||
**Pipeline shape** (inside
|
||
[`rtl/top/top_psmct32_raster_demo_bram.sv`](../../rtl/top/top_psmct32_raster_demo_bram.sv)):
|
||
|
||
```
|
||
emit cycle N: is_t4_emit=1; vram_read2_addr = byte_addr & ~3;
|
||
pipe_q <= (byte_addr, nibble_hi, nibble[3:0]).
|
||
posedge → cycle N+1: vram_read2_data = mem[byte_addr] (sync read);
|
||
splice new_byte = nibble_hi
|
||
? {nibble, old[3:0]}
|
||
: {old[7:4], nibble};
|
||
drive vram_we_final=1, write_addr=byte_addr&~3,
|
||
write_data shifted to byte_addr[1:0] lane,
|
||
write_be one-hot to that lane.
|
||
posedge → cycle N+2: mem[byte_addr] commits new_byte.
|
||
```
|
||
|
||
`old_byte` is sourced from the lane-correct slice of
|
||
`vram_read2_data`. CT32/CT16/T8 emits skip the pipe entirely and
|
||
fall through `vram_norm` same-cycle (CT32 stays a passthrough,
|
||
existing TBs unaffected).
|
||
|
||
**Forwarding hazard — back-to-back same-byte writes**: a PSMT4
|
||
SPRITE rasters adjacent pixels at `x=2k` and `x=2k+1` to the
|
||
SAME `byte_addr` (low + high nibble of a single byte). At cycle
|
||
N+1 the wrapper reads `mem[byte_addr]` for emit-2 in the SAME
|
||
posedge that emit-1's write commits. NBA semantics inside
|
||
`vram_bram_stub` (separate `always_ff` blocks for the write port
|
||
and the read port) make the read see the PRE-write value, so
|
||
emit-2 would splice against stale data. The Ch157 pipe carries
|
||
a 1-deep `t4_prev_*` register set (addr + new_byte from the
|
||
just-completed RMW) and forwards `t4_prev_new_byte_q` whenever
|
||
the in-flight emit's `byte_addr` matches the previous emit's
|
||
`byte_addr`. The forwarding chain extends across any number of
|
||
back-to-back same-byte emits — emit-N reads emit-(N-1)'s
|
||
`new_byte` from the forward register, splices on top, and
|
||
emit-(N+1) reads emit-N's new_byte from that same register.
|
||
|
||
| File / target | Role |
|
||
| ---------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
|
||
| `rtl/top/top_psmct32_raster_demo_bram.sv` | Ch156 hard-gate replaced by the RMW pipe + forwarding registers; `vram_read2_addr` driven on T4 emit cycles; `vram_we_final` mux selects T4 pipe write or non-T4 same-cycle path. |
|
||
| `tb_gs_raster_bram_psmt4` | New positive-proof TB — drives a 16×4 LINEAR PSMT4 SPRITE (PSMT4_SWIZZLE=0 so adjacent x's hit the same byte) split into two halves with distinct nibbles (0xA / 0x5). 64 raster emits; verifies every byte under the sprite holds the expected pair of spliced nibbles (left half = 0xAA, right half = 0x55) plus sentinel preserved on bytes outside the sprite. **PASS**. |
|
||
| `tb_gs_raster_bram_psmt4_gate` | Retired — the gate it asserted no longer exists. |
|
||
|
||
**Why LINEAR PSMT4 in the new TB**: the linear address formula
|
||
`(y*FBW*32) + (x>>1)` puts adjacent x's at the same byte, which
|
||
is exactly the back-to-back same-byte forwarding hazard. The
|
||
swizzled path scatters bytes via `columnTable4`, so it touches
|
||
the forwarding logic less often. Linear coverage is strictly
|
||
stronger here.
|
||
|
||
**Non-T4 TB cleanup**: `tb_gs_raster_bram_psmct16` and
|
||
`tb_gs_raster_bram_psmt8` still mirror the *non-T4* portion of
|
||
the wrapper-site plumbing, but they no longer carry the Ch156
|
||
PSMT4 hard-gate (now removed in the wrapper). Both wire
|
||
`raster_pixel_emit` straight to `write_en` and let
|
||
`vram_norm` drive addr/data/be — focused TBs verifying their
|
||
own PSM lane. Full pipe coverage lives in `tb_gs_raster_bram_psmt4`
|
||
and the top wrapper TB.
|
||
|
||
**Sim regression**: 141 PASS / 0 FAIL after Ch157 (140 + new
|
||
`tb_gs_raster_bram_psmt4` − retired `tb_gs_raster_bram_psmt4_gate`).
|
||
|
||
### PCRTC sync-read alignment (Ch158)
|
||
|
||
Ch158 closes the last big blocker before swapping the board top
|
||
to `vram_bram_stub`: the PCRTC's data-decode + sync-output
|
||
pipeline is now aware that `vram_bram_stub`'s `read_data` is
|
||
registered with 1-cycle latency, so the captured scanout no
|
||
longer trails the address stage by one column.
|
||
|
||
**`gs_pcrtc_stub` change** (in
|
||
[`rtl/gif_gs/gs_pcrtc_stub.sv`](../../rtl/gif_gs/gs_pcrtc_stub.sv)):
|
||
new module parameter `VRAM_SYNC_READ` (default 0). When set to 1,
|
||
every hcnt/vcnt-derived signal that the data-decode comb consumes
|
||
is run through a 1-cycle register before the consumer sees it
|
||
(`active_h_dec`, `active_v_dec`, `in_hsync_dec`, `in_vsync_dec`,
|
||
`in_display_window_dec`, `scanout_enable_dec`, `dispfb_psm_*_dec`,
|
||
`psm4_nibble_select_dec`, `end_of_frame_dec`). The address-side
|
||
(`vram_read_addr`) keeps using the current `(hcnt, vcnt)` so the
|
||
read is issued one pixel "ahead"; the registered `vram_read_data`
|
||
arrives one cycle later, paired with the matching delayed counter
|
||
view. Outputs `r/g/b/hsync/vsync/de` come from the `_dec` signals,
|
||
so the entire output stream shifts right by exactly one clock
|
||
when `VRAM_SYNC_READ=1`. Default `VRAM_SYNC_READ=0` is a pure
|
||
passthrough — every existing PCRTC TB written against the legacy
|
||
`vram_stub` (comb-read) shape is unaffected.
|
||
|
||
**`top_psmct32_raster_demo_bram` change**: instantiates
|
||
`gs_pcrtc_stub` with `.VRAM_SYNC_READ(1'b1)`. The wrapper banner
|
||
updates to drop the Ch155 caveat about scanout being 1 column
|
||
shifted — that caveat is now resolved.
|
||
|
||
**`tb_top_psmct32_raster_demo_bram` extension**: adds a Phase 2
|
||
frame-capture block that arms on the next vsync rising edge
|
||
after raster drain, captures one full frame's r/g/b into
|
||
`cap_*[v][h]` indexed by a 1-cycle-delayed copy of PCRTC's
|
||
address-stage counters (since the registered `de` aligns with
|
||
those delayed counters), and asserts each captured pixel's
|
||
post-decode r/g/b matches the expected ABGR for its quadrant.
|
||
Phase 1 (per-pixel VRAM probe via hierarchical `mem[byte_addr >> 2]`)
|
||
is unchanged. **PASS** — 16×8 active region, all 128 pixels
|
||
captured + all 128 VRAM words probe-verified, `frame_seen`
|
||
latched.
|
||
|
||
**Open Ch159+ items**:
|
||
- xfer-side T4 coverage TB — the Ch157 wrapper handles xfer-side
|
||
T4 emits identically (the mux feeds `vram_psm_pre` from
|
||
`bitbltbuf_q[61:56]` during `xfer_busy`), but no focused TB
|
||
exercises that path yet.
|
||
- Swap the Ch146 board top to instantiate `vram_bram_stub` and
|
||
the Ch158 PCRTC-sync mode directly (or retire `vram_stub`
|
||
outright). All four writer PSMs and PCRTC scanout are now
|
||
proven correct against the BRAM-friendly contract; the
|
||
remaining work is the integration commit on the board side.
|
||
- Audit `useg_shadow_mem` for the same BRAM-shape forensics that
|
||
Ch153 ran on `vram_stub` (Ch64/Ch65/Ch70 mirror writes may
|
||
make it multi-port-write-shaped).
|
||
|
||
**Ch158 audit Medium fix — sub-word PSM lane selection**: the
|
||
initial Ch158 cut shifted the data-decode pipeline by 1 cycle
|
||
to align with `vram_bram_stub`'s registered output, but it
|
||
still extracted CT16 / PSMT8 / PSMT4 sub-word values from the
|
||
LOW lane of `vram_read_data` (i.e. `[15:0]` halfword and
|
||
`[7:0]` byte). That worked for `vram_stub` (byte-addressable;
|
||
the read returns 4 bytes starting at `byte_addr` so the
|
||
sub-word always lands at the low lane) but NOT for
|
||
`vram_bram_stub` (word-addressable; `read_data` is
|
||
`mem[byte_addr >> 2]` so the sub-word lives at lane
|
||
`byte_addr[1:0]` of the returned word). Codex Ch158 audit
|
||
called this out as a blocker for any sub-word PSM scanout
|
||
through the BRAM. The fix adds:
|
||
|
||
- `vram_addr_lane_q` — 1-cycle-delayed copy of
|
||
`vram_read_addr[1:0]`, paralleling the other `_q` decode-
|
||
stage registers added in the original Ch158 cut.
|
||
- `data_lane = VRAM_SYNC_READ ? vram_addr_lane_dec : 2'd0` —
|
||
forces the legacy comb-read path to keep using the low lane
|
||
(preserving every existing PCRTC TB's expectation), and
|
||
resolves to the correct byte_addr-keyed lane in sync mode.
|
||
- `psm16_pixel = data_lane[1] ? read_data[31:16] : read_data[15:0]`.
|
||
- A `vram_byte_lane` mux extracting one of 4 byte lanes for
|
||
PSMT8 (`psm8_idx`) and PSMT4 (`psm4_byte_lane` → nibble
|
||
splice).
|
||
|
||
Two new focused integration TBs prove the fix end-to-end with
|
||
adversarial pre-loads:
|
||
|
||
| TB | Coverage |
|
||
| --------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------- |
|
||
| [`tb_gs_scanout_bram_psmct16`](../../sim/tb/gif_gs/tb_gs_scanout_bram_psmct16.sv) | 4-pixel CT16 scanout reading mem[0]/mem[1] with FOUR distinct halfwords across both halfword lanes (`byte_addr[1]∈{0,1}`); each pixel's captured 5→8-decoded RGB matches the expected halfword. **PASS** |
|
||
| [`tb_gs_scanout_bram_psmt8`](../../sim/tb/gif_gs/tb_gs_scanout_bram_psmt8.sv) | 4-pixel PSMT8 scanout reading mem[0] with FOUR distinct byte indices, one per byte lane (`byte_addr[1:0] ∈ {0,1,2,3}`); each pixel's grayscale RGB matches the expected byte. **PASS** |
|
||
|
||
Without the fix, both TBs would have failed: the CT16 TB would
|
||
emit the same pair of pixels twice (low halfword of each word),
|
||
and the PSMT8 TB would emit `IDX_0` for all four pixels.
|
||
|
||
**Sim regression**: 143 PASS / 0 FAIL after Ch158 audit fixes
|
||
(141 + 2 new BRAM scanout TBs).
|
||
|
||
### Board-top swap to BRAM wrapper + Quartus fit recovery (Ch159)
|
||
|
||
Ch159 commits the integration step that the prior chapters
|
||
were building toward: the DE25-Nano board top
|
||
([`rtl/top/de25_nano_psmct32_raster_demo_top.sv`](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv))
|
||
now instantiates [`top_psmct32_raster_demo_bram`](../../rtl/top/top_psmct32_raster_demo_bram.sv)
|
||
instead of the legacy [`top_psmct32_raster_demo`](../../rtl/top/top_psmct32_raster_demo.sv).
|
||
External port shape is identical so this is drop-in at the
|
||
board level; the BRAM-backed wrapper carries through every
|
||
Ch155-Ch158 fix (writer-side normalize + PSMT4 RMW pipe +
|
||
PCRTC sync-read alignment + sub-word lane select). The synth
|
||
file list ([`synth/de25_nano/top_psmct32_raster_demo/files.f`](../../synth/de25_nano/top_psmct32_raster_demo/files.f))
|
||
and Quartus QSF gain `vram_normalize_pkg.sv`, `vram_bram_stub.sv`,
|
||
and `top_psmct32_raster_demo_bram.sv`; the legacy `vram_stub`
|
||
+ legacy top stay on the project for back-compat with sim TBs
|
||
that still target them.
|
||
|
||
**Quartus fit recovery — vs Ch152 baseline**: the headline of
|
||
this chapter. Ch152 fit FAILED at 155k ALMs needed (331% over)
|
||
because `vram_stub`'s 8 KiB byte-addressable + per-bit-RMW
|
||
storage didn't infer as M20K and landed as a 65,536-flip-flop
|
||
array, dragging 121k registers and 199k synthesis ALMs along
|
||
with it. Ch159 swap turns those numbers around:
|
||
|
||
| Metric | Ch152 (vram_stub) | Ch159 (vram_bram_stub) | Δ |
|
||
| ---------------------------------- | ---------------------------- | ----------------------------- | ----------------------- |
|
||
| Synthesis status | Successful | Successful | — |
|
||
| Synthesis ALMs estimate | 199,103 / 46,800 (425% over) | **22,704 / 46,800 (49%)** | −176,399 (**−88.6%**) |
|
||
| Synthesis registers | 101,457 | **36,008** | −65,449 (**−64.5%**) |
|
||
| **Fit status** | **FAILED** (155k / 331% over) | **Successful** (30,364 / 65%) | ✅ **fits** |
|
||
| Fit registers | 121,176 | **39,085** | −82,091 (**−67.7%**) |
|
||
| Fit RAM blocks | 6 / 358 | **14 / 358** | +8 (BRAM-inferred VRAM) |
|
||
| Fit block memory bits | 65,536 | **196,608** | +131,072 (data in M20K) |
|
||
| Fit DSP blocks | 20 | 18 | −2 |
|
||
| **STA status** | **DID NOT RUN** (fit failed) | **Successful** (12 warnings) | ✅ STA reachable |
|
||
| STA setup slack worst (CLOCK2_50) | n/a | −6.950 ns | timing miss at 50 MHz |
|
||
| Fmax | n/a | 37.11 MHz | (Ch160+ tunes) |
|
||
|
||
The eight new RAM blocks are the same `vram_bram_stub`
|
||
footprint exp_c proved in Ch154 (8 RAM blocks for the dual-port
|
||
+ admission-gated 8 KiB shape; the +6 already in the Ch152
|
||
baseline came from `bios_rom_stub` + `ee_ram_stub` +
|
||
`useg_shadow_mem` correctly inferring as BRAM there). The
|
||
register drop (121k → 39k) is essentially the entire VRAM
|
||
flip-flop array vanishing.
|
||
|
||
**Setup-slack reality check**: STA reports −6.950 ns slack at
|
||
the CLOCK2_50 50 MHz constraint (Fmax = 37.11 MHz). The
|
||
critical path is somewhere in the Ch123 dep tree's longer
|
||
combinational chains (likely the Gouraud divider or one of
|
||
the swizzle muxers). That is **NOT** a Ch159 regression — it's
|
||
a brand-new visibility unlocked by being able to run STA at
|
||
all. Ch160+ owns timing closure (PLL down-clock to ≤30 MHz,
|
||
critical-path pipelining, or both).
|
||
|
||
**Snapshots preserved**: the Ch152 baseline reports are saved
|
||
under
|
||
[`synth/de25_nano/top_psmct32_raster_demo/baseline_ch152/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch152/)
|
||
(syn / fit summaries + flow.rpt + parse_report.txt) so future
|
||
chapters can diff against them without re-running the failing
|
||
Ch152 baseline.
|
||
|
||
**Sim regression**: 143 PASS / 0 FAIL unchanged. The Ch149
|
||
board-wrapper TB exercises the same external behavior with
|
||
the new core wrapper inside.
|
||
|
||
### Down-clock target + first .sof bitstream (Ch160)
|
||
|
||
Ch160 closes the loop Codex framed at the end of Ch159 — "first
|
||
add a down-clock PLL profile so we can get a real bitstream
|
||
moving on hardware, then use the successful STA path report to
|
||
decide whether to pipeline toward 50 MHz." The chapter is SDC-
|
||
and build-flow-only; no RTL changes.
|
||
|
||
**SDC retarget** ([`synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc`](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc))
|
||
relaxes the CLOCK2_50 period from 20.000 ns (50 MHz) to
|
||
33.333 ns (30 MHz). The DE25-Nano's CLOCK2_50 oscillator is
|
||
physically still 50 MHz; the SDC tells Quartus to ASSUME a
|
||
30 MHz input so the fitter closes timing at the down-clock
|
||
target. A real PLL `.ip` that divides 50 → 30 MHz on hardware
|
||
is the Ch161+ commit (the QSF's commented-out `QIP_FILE`
|
||
swap-point is staged for it). Until then, the .sof produced
|
||
under this constraint is structurally clean for 30 MHz
|
||
operation; programming it on a board where CLOCK2_50 is still
|
||
wired straight through gives an effective 50 MHz chip clock
|
||
that may show setup-violating behavior — Ch161 closes that
|
||
gap.
|
||
|
||
**`build_quartus.sh` adds `quartus_asm`** ([`synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh`](../../synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh))
|
||
gated on a clean STA, so a `.sof` bitstream is now produced
|
||
when the design fits and timing closes. The Make scaffold
|
||
check is loosened to accept either the 50 MHz (legacy) or
|
||
33.333 ns (Ch160 down-clock) period.
|
||
|
||
**Quartus result vs Ch159**:
|
||
|
||
| Metric | Ch159 (50 MHz target) | Ch160 (30 MHz target) |
|
||
|-------------------------------|-------------------------------|-------------------------------|
|
||
| Synth ALMs estimate | 22,704 / 46,800 (49 %) | 22,704 / 46,800 (49 %) |
|
||
| Synth registers | 36,008 | 36,008 |
|
||
| Fit status | Successful | Successful |
|
||
| Fit ALMs | 30,364 / 46,800 (65 %) | 31,056 / 46,800 (66 %) |
|
||
| Fit registers | 39,085 | 37,381 |
|
||
| Fit RAM blocks | 14 / 358 | 14 / 358 |
|
||
| **STA setup slack worst** | **−6.950 ns** (timing miss) | **+0.805 ns** (closes) |
|
||
| **Fmax (CLOCK2_50)** | 37.11 MHz | 30.74 MHz |
|
||
| **`quartus_asm`** | (skipped) | **Successful — `.sof` produced** |
|
||
|
||
The synth-side numbers are identical because no RTL changed —
|
||
the differences are entirely in the fitter's placement choices
|
||
under the looser timing constraint. Fmax dropped slightly
|
||
(37.11 → 30.74 MHz) because Quartus optimizes harder when the
|
||
target is tighter; the headline is that **at the 30 MHz target
|
||
the design CLOSES** (positive slack on every report) and a
|
||
real `.sof` is now generated.
|
||
|
||
**Critical path** (from
|
||
[`output_files/de25_nano_psmct32_raster_demo_top.sta.rpt`](../../synth/de25_nano/top_psmct32_raster_demo/output_files/de25_nano_psmct32_raster_demo_top.sta.rpt),
|
||
worst-10 paths all in the same module hierarchy):
|
||
|
||
| Field | Value |
|
||
|--------------|------------------------------------------------------------------------------------------|
|
||
| Slack | +0.805 ns (worst path of 10 with this slack value) |
|
||
| From / To | `u_demo|u_core|div_0_rtl_0|auto_generated|divider|divider|...` (intra-divider register-to-register) |
|
||
| Data Delay | 32.516 ns (out of 33.333 ns period) |
|
||
| Critical net | The EE core's auto-generated 64-bit signed divider (the Ch152-noted Gouraud TRI divider — dead code in the PSMCT32 raster demo because no `RM_TRI` primitive is dispatched). |
|
||
|
||
**Ch161+ pipelining handoff**: the path Codex's framing asked
|
||
us to surface is now visible. Two options:
|
||
|
||
1. **Pipeline the divider** — re-implement `ee_core`'s 64-bit
|
||
division as an N-cycle multi-cycle path. Quartus's auto-
|
||
generated divider is a single-cycle ripple chain; making it
|
||
2-3 stage pipelined would put Fmax comfortably above 50 MHz.
|
||
2. **Strip it from the build** — gate the Gouraud TRI
|
||
divider behind a `STRIP_GOURAUD_TRI` parameter (default
|
||
off), so the PSMCT32 raster demo's hardware build instances
|
||
the EE core without it. Quartus removes the entire
|
||
`div_0_rtl_0` block; Fmax should jump dramatically.
|
||
|
||
Option 2 is the lower-blast-radius hardware bring-up move
|
||
(removes ~32 ns of dead-code combinational chain); option 1
|
||
is the long-term correct fix once the Gouraud TRI path goes
|
||
load-bearing.
|
||
|
||
**Snapshots**: Ch159 baseline reports preserved under
|
||
[`baseline_ch159/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch159/)
|
||
(syn / fit / sta summaries + parse_report).
|
||
|
||
**Sim regression**: 143 PASS / 0 FAIL unchanged (no RTL
|
||
changes). Scaffold check + Ch149 board TB + top BRAM TB all
|
||
green under the new SDC.
|
||
|
||
### Real PLL IP commit — `.sof` actually runs at 30 MHz (Ch161)
|
||
|
||
Ch161 retires the Ch160 hardware-honesty caveat by committing a
|
||
real Quartus IOPLL `.ip` configured for 50 MHz refclk → 30 MHz
|
||
outclk_0. The wrapper's `\`ifdef USE_PLL_IP` (staged in Ch151)
|
||
now flips to the IP-generated `pll` module on Quartus builds;
|
||
sim TBs continue to use the pass-through `de25_nano_pll_stub`.
|
||
|
||
**Files committed under
|
||
[`synth/de25_nano/top_psmct32_raster_demo/ip/`](../../synth/de25_nano/top_psmct32_raster_demo/ip/)**:
|
||
|
||
- `pll.ip` — adapted from `retroDE_nes/ip/audio_pll.ip` (single-
|
||
output Agilex 5 IOPLL template), retargeted to 50 MHz refclk
|
||
→ 30 MHz outclk_0.
|
||
- `pll/pll.qip` + `pll/synth/pll.v` + `pll/pll_bb.v` — Quartus
|
||
IP-generated artifacts (`quartus_ipgenerate de25_nano_psmct32_raster_demo_top --ip_file=ip/pll.ip --generate_ip_file --synthesis=verilog`).
|
||
The generated `pll` module exposes
|
||
`(refclk, rst, outclk_0, locked)` — exactly the Ch151 stub's
|
||
signature, so the `\`ifdef` swap is drop-in.
|
||
|
||
**Wiring changes**:
|
||
|
||
- `de25_nano_psmct32_raster_demo_top.qsf` uncommented the
|
||
`set_global_assignment -name QIP_FILE ip/pll/pll.qip` swap-
|
||
point and added
|
||
`set_global_assignment -name VERILOG_MACRO "USE_PLL_IP=1"` so
|
||
Quartus instantiates the IP `pll` instead of the
|
||
`de25_nano_pll_stub`.
|
||
- `de25_nano_psmct32_raster_demo_top.sdc` reverted the Ch160
|
||
CLOCK2_50 period back to 20.000 ns (the physical 50 MHz
|
||
oscillator). The IOPLL's auto-generated SDC inside the .qip
|
||
declares the post-PLL `outclk_0` clock at 30 MHz, so STA
|
||
picks up two domains: `u_pll|iopll_0_refclk` (50 MHz, the
|
||
pin) and `u_pll|iopll_0_outclk0` (30 MHz, the design clock).
|
||
- `build_quartus.sh` symlinks the `ip/` dir alongside the
|
||
existing `rtl/` and `sim/` symlinks so the QIP_FILE's
|
||
`ip/pll/pll.qip` path resolves from the work dir.
|
||
|
||
**Quartus result vs Ch160**:
|
||
|
||
| Metric | Ch160 (SDC profile only) | Ch161 (real PLL IP) |
|
||
|-----------------------------------|------------------------------|------------------------------|
|
||
| Fit ALMs | 31,056 / 46,800 (66 %) | 30,898 / 46,800 (66 %) |
|
||
| Fit registers | 37,381 | 37,352 |
|
||
| **Fit PLLs** | **0 / 11** | **1 / 11** (real IOPLL) |
|
||
| RAM blocks | 14 / 358 | 14 / 358 |
|
||
| Setup slack worst (design_clk) | +0.805 ns @ CLOCK2_50 | **+0.565 ns @ u_pll|iopll_0_outclk0** |
|
||
| Fmax (design_clk) | 30.74 MHz | **30.74 MHz** |
|
||
| `quartus_asm` | Successful | Successful (`.sof` produced) |
|
||
|
||
The `+1` PLL block is the real IOPLL on the chip; ALMs go down
|
||
slightly because the stub's clock-distribution path no longer
|
||
needs ALM glue. STA now reports BOTH clock domains: the refclk
|
||
(50 MHz, +19.249 ns slack — trivially fast) and the design_clk
|
||
(30 MHz post-PLL, +0.565 ns slack — comfortable margin). The
|
||
`.sof` produced under this configuration **genuinely runs at
|
||
30 MHz on the DE25-Nano**: the IOPLL takes the 50 MHz CLOCK2_50
|
||
input and divides to 30 MHz inside the chip, so the entire
|
||
design downstream of `u_pll.outclk_0` operates at the
|
||
constrained frequency. (Setup slack landed at +0.914 ns on the
|
||
initial Ch161 build; the Ch161 audit's wider reset false-path
|
||
nudged the fitter into a slightly different placement, dropping
|
||
the worst-case setup slack to +0.565 ns. Recovery analysis on
|
||
the rst_sync stages — which had been hiding a real -0.079 ns
|
||
violation under the original `*rst_sync[0]` constraint — is now
|
||
gone from the .sta.summary entirely after the false-path was
|
||
widened to `*rst_sync[*]`.)
|
||
|
||
**Snapshots**: Ch160 baseline (parse_report + summaries +
|
||
`.sof`) preserved under
|
||
[`baseline_ch160/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch160/).
|
||
|
||
**Open Ch162+ items** (Ch161 forward-ref, **superseded by
|
||
Ch162 below**):
|
||
|
||
- ~~Pipeline or strip the EE-core 64-bit Gouraud TRI divider~~ —
|
||
**closed in Ch162** via `STRIP_HW_DIVIDER` (note: the actual
|
||
divider is the Ch43 DIVU divider, not Gouraud TRI; the
|
||
forward-ref's name was loose). The Ch162 strip retired the
|
||
`u_demo|u_core|div_0_rtl_0|...` STA worst path entirely; see
|
||
the Ch162 section below for the new critical path.
|
||
- xfer-side T4 coverage TB (open from Ch157+).
|
||
- `useg_shadow_mem` BRAM-shape forensics.
|
||
- Video PHY shim (HDMI / VGA / PMOD) — `VIDEO_*` pins
|
||
virtualized.
|
||
|
||
**Sim regression**: 143 PASS / 0 FAIL unchanged. Sim ignores
|
||
the `\`ifdef USE_PLL_IP` (no `+define+USE_PLL_IP` in the
|
||
iverilog Makefile) so the stub stays active under sim.
|
||
|
||
### Strip the EE-core hardware divider (Ch162)
|
||
|
||
Ch162 takes the lower-blast move from the Ch161 STA handoff:
|
||
add a parameter that gates the EE-core's auto-inferred 32-bit
|
||
hardware divider out of synthesis on the PSMCT32 SPRITE-only
|
||
hardware build, then re-measure Fmax.
|
||
|
||
**RTL change** ([rtl/ee/ee_core_stub.sv](../../rtl/ee/ee_core_stub.sv))
|
||
gains `parameter bit STRIP_HW_DIVIDER = 1'b0`. Two `/` and `%`
|
||
sites tied to the Ch43 DIVU instruction are gated by this
|
||
parameter — the writeback (lines ~932-935) and the retire-
|
||
trace `arg3` mirror (lines ~1005-1014). Default `0` keeps
|
||
DIVU semantics intact for every existing sim TB
|
||
(`tb_ee_core_divu_mflo` is the only consumer; it stays at the
|
||
default). When the parameter is `1`, the writeback becomes a
|
||
no-op (HI/LO unchanged, identical to the divisor==0 case the
|
||
spec calls undefined) and the retire-trace `arg3` reports 0.
|
||
Quartus then has nothing to infer — the `div_0_rtl_0` block
|
||
disappears.
|
||
|
||
**Wrapper plumbing**:
|
||
[`top_psmct32_raster_demo_bram`](../../rtl/top/top_psmct32_raster_demo_bram.sv)
|
||
gains a matching `STRIP_HW_DIVIDER` parameter and forwards it
|
||
to `ee_core_stub`. The
|
||
[DE25-Nano board top](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv)
|
||
sets `.STRIP_HW_DIVIDER(1'b1)` on its `u_demo` instantiation
|
||
(the bootlet doesn't execute DIVU, so this is behavior-neutral
|
||
for the demo). Sim TBs that instantiate the BRAM wrapper
|
||
directly use the default 0.
|
||
|
||
**Quartus result vs Ch161 (real-PLL baseline)**:
|
||
|
||
| Metric | Ch161 (real PLL) | Ch162 (real PLL + strip) |
|
||
|-----------------------------------|-------------------------------|-------------------------------|
|
||
| Fit ALMs | 30,898 / 46,800 (66 %) | 30,006 / 46,800 (64 %) |
|
||
| Fit registers | 37,352 | 36,618 |
|
||
| Fit PLLs | 1 | 1 |
|
||
| RAM blocks | 14 | 14 |
|
||
| **Setup slack worst (design)** | +0.565 ns | **+3.567 ns** |
|
||
| **Fmax (design domain)** | 30.74 MHz | **33.6 MHz** (+9.4 %) |
|
||
| `quartus_asm` | Successful | Successful (`.sof` produced) |
|
||
|
||
Stripping the divider freed 892 ALMs / 734 registers and
|
||
yielded ~3 ns of new setup margin. **Fmax climbs from 30.74
|
||
MHz to 33.6 MHz** — a real jump, but **not enough to clear the
|
||
50 MHz target** (which would need a +67 % jump). Codex's
|
||
Ch162 framing predicted this branch: "if Fmax jumps, we have a
|
||
clean path to a 50 MHz demo bitstream; if not, the next real
|
||
critical path will reveal itself." We landed in the second
|
||
branch — Fmax jumped, but not far enough.
|
||
|
||
**New critical path** (the Ch163+ handoff, from
|
||
[`output_files/de25_nano_psmct32_raster_demo_top.sta.rpt`](../../synth/de25_nano/top_psmct32_raster_demo/output_files/de25_nano_psmct32_raster_demo_top.sta.rpt)):
|
||
|
||
| Field | Value |
|
||
|------------|-------------------------------------------------------------------------------------------------------------------|
|
||
| Slack | +3.567 ns |
|
||
| From | `u_demo|u_pcrtc|div_1_rtl_0|auto_generated|divider|divider|...` (PCRTC magnification divider) |
|
||
| To | `u_demo|u_vram|mem_rtl_0|auto_generated|altera_syncram_impl1|ram_block2a15~reg0` (VRAM port input) |
|
||
| Data delay | 38.443 ns of arrival vs 42.010 ns required (period 33.333 ns + clock skew + uncertainty) |
|
||
|
||
The PCRTC divider comes from
|
||
[`gs_pcrtc_stub.sv`](../../rtl/gif_gs/gs_pcrtc_stub.sv) lines:
|
||
```
|
||
assign vram_x_unshift = {20'd0, hwin_rel} / hmag_factor;
|
||
assign vram_y_unshift = {20'd0, vwin_rel} / vmag_factor;
|
||
```
|
||
where `hmag_factor = MAGH + 1` and `vmag_factor = MAGV + 1`.
|
||
For the demo `MAGH = MAGV = 0`, so the divisor is constant 1
|
||
— but Quartus doesn't constant-propagate through this
|
||
formulation and synthesizes a real 32-bit divider anyway. The
|
||
parallel Ch162 fix shape would be a `STRIP_PCRTC_MAG_DIV`
|
||
parameter (or a more general "demo doesn't use magnification"
|
||
hint that bypasses the divider when both MAGH and MAGV are
|
||
constant 0).
|
||
|
||
**Snapshots**: Ch161 baseline preserved under
|
||
[`baseline_ch161/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch161/)
|
||
(syn / fit / sta summaries + parse_report + .sof) for diff.
|
||
|
||
**Open Ch163+ items**:
|
||
- Strip the PCRTC magnification divider on hardware builds
|
||
(next critical path; same shape as Ch162's
|
||
`STRIP_HW_DIVIDER`).
|
||
- Once Fmax climbs north of 50 MHz, retune the IOPLL `.ip` to
|
||
outclk_0 = 50 MHz, retarget the SDC, and ship a 50 MHz
|
||
bitstream.
|
||
- xfer-side T4 coverage TB (still open from Ch157+).
|
||
- `useg_shadow_mem` BRAM-shape forensics.
|
||
- Video PHY shim (HDMI / VGA / PMOD) — `VIDEO_*` pins
|
||
virtualized.
|
||
|
||
**Sim regression**: 143 PASS / 0 FAIL unchanged. Default
|
||
`STRIP_HW_DIVIDER=0` preserves DIVU semantics for
|
||
`tb_ee_core_divu_mflo`; the board top's `STRIP_HW_DIVIDER=1`
|
||
goes through `tb_de25_nano_psmct32_raster_demo_top` cleanly
|
||
because the Ch149 board TB doesn't exercise DIVU.
|
||
|
||
### Strip PCRTC magnification divider + 50 MHz close (Ch163)
|
||
|
||
Ch163 takes the next critical-path attack from the Ch162 STA
|
||
report (the PCRTC magnification divider) and uses the resulting
|
||
Fmax headroom to retune the PLL IP to 50 MHz output — closing
|
||
the journey that started at the Ch152 fit failure with a real
|
||
50 MHz bitstream.
|
||
|
||
**RTL change** ([rtl/gif_gs/gs_pcrtc_stub.sv](../../rtl/gif_gs/gs_pcrtc_stub.sv))
|
||
gains `parameter bit STRIP_PCRTC_MAG_DIV = 1'b0`. The two `/`
|
||
operators are gated:
|
||
```
|
||
assign vram_x_unshift = STRIP_PCRTC_MAG_DIV
|
||
? {20'd0, hwin_rel}
|
||
: ({20'd0, hwin_rel} / hmag_factor);
|
||
assign vram_y_unshift = STRIP_PCRTC_MAG_DIV
|
||
? {20'd0, vwin_rel}
|
||
: ({20'd0, vwin_rel} / vmag_factor);
|
||
```
|
||
Default `0` keeps the live divider math for every Ch93-era
|
||
magnification scanout TB (`tb_gs_scanout_magh_magv` etc.). When
|
||
`1`, the math collapses to a passthrough — equivalent to the
|
||
MAGH=MAGV=0 case the demo always hits but with no inferred
|
||
divider for Quartus to synthesize.
|
||
|
||
**Wrapper plumbing**:
|
||
[`top_psmct32_raster_demo_bram`](../../rtl/top/top_psmct32_raster_demo_bram.sv)
|
||
gains a matching `STRIP_PCRTC_MAG_DIV` parameter that forwards
|
||
to `gs_pcrtc_stub`. The
|
||
[DE25-Nano board top](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv)
|
||
sets `.STRIP_PCRTC_MAG_DIV(1'b1)` on its `u_demo` instantiation.
|
||
|
||
**Quartus result, two stages**:
|
||
|
||
*Stage A — strip @ 30 MHz target (still on the Ch161 PLL .ip)*:
|
||
|
||
| Metric | Ch162 (strip EE divider only) | Ch163 (strip both, 30 MHz) |
|
||
|-----------------------|-------------------------------|----------------------------|
|
||
| Fit ALMs | 30,006 / 46,800 (64 %) | 27,216 / 46,800 (58 %) |
|
||
| Setup slack worst | +3.567 ns | +21.113 ns |
|
||
| **Fmax (design)** | 33.6 MHz | **81.83 MHz** (+143 %) |
|
||
|
||
The Ch163 strip alone freed +17.5 ns of margin and 2,790 ALMs
|
||
— large enough to clear 50 MHz outright. Codex's Ch162 framing
|
||
predicted both branches of the if-Fmax-jumps fork; Ch163 lands
|
||
in the **first** branch ("clean path to a 50 MHz demo
|
||
bitstream").
|
||
|
||
*Stage B — retune PLL .ip from 30 MHz → 50 MHz output*:
|
||
|
||
The `pll.ip` source's `gui_output_clock_frequency0` and
|
||
`gui_output_clock_frequency_ps0` are bumped (30.0 → 50.0 MHz;
|
||
33333.333 → 20000.0 ps). `quartus_ipgenerate` rebuilds the
|
||
.qip / synth files in-place. No SDC change needed — CLOCK2_50
|
||
stays pinned at the physical 50 MHz period; the IOPLL's auto-
|
||
generated SDC declares the new outclk_0 frequency.
|
||
|
||
| Metric | Ch163 strip @ 30 MHz target | Ch163 strip @ 50 MHz target |
|
||
|-----------------------|-----------------------------|------------------------------|
|
||
| Fit ALMs | 27,216 / 46,800 (58 %) | 27,543 / 46,800 (59 %) |
|
||
| RAM blocks / PLLs | 14 / 1 | 14 / 1 |
|
||
| **Setup slack worst** | +21.113 ns | **+7.500 ns** |
|
||
| **Fmax (design)** | 81.83 MHz | **80.0 MHz** |
|
||
| `.sof` produced | yes (30 MHz run on hw) | **yes — 50 MHz on hw** |
|
||
|
||
**The .sof produced under Stage B genuinely runs at 50 MHz on
|
||
the DE25-Nano** — the IOPLL takes 50 MHz CLOCK2_50 in and
|
||
emits 50 MHz outclk_0 (effectively a 1:1 relation through the
|
||
real PLL hardware so the chip's clock distribution still goes
|
||
through the IOPLL's clock network). All 8 timing classes
|
||
positive; no recovery violations; build gate Successful.
|
||
|
||
**Snapshots**:
|
||
- [`baseline_ch162/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch162/)
|
||
— Ch162 30 MHz state with EE divider stripped only.
|
||
- [`baseline_ch163_30mhz/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch163_30mhz/)
|
||
— Ch163 strip-both at 30 MHz target (Stage A milestone).
|
||
|
||
**Open Ch164+ items** (the project has hit the major hardware
|
||
milestone Codex called out at Ch157+; Ch164+ is post-launch):
|
||
- xfer-side T4 coverage TB (open from Ch157+).
|
||
- `useg_shadow_mem` BRAM-shape forensics.
|
||
- Video PHY shim (HDMI / VGA / PMOD) — `VIDEO_*` pins still
|
||
virtualized; this is the next big front-end deliverable
|
||
before the demo can paint a real screen.
|
||
|
||
**Sim regression**: 143 PASS / 0 FAIL unchanged. Default-off
|
||
on `STRIP_PCRTC_MAG_DIV` preserves every Ch93 magnification
|
||
scanout TB; the board top's `STRIP_PCRTC_MAG_DIV=1` propagates
|
||
cleanly through `tb_de25_nano_psmct32_raster_demo_top` since
|
||
the demo locks MAGH=MAGV=0.
|
||
|
||
### HDMI pin shim — pixels off-chip (Ch164)
|
||
|
||
Ch164 is the first video-PHY chapter — Codex's framing was "small
|
||
PHY shim chapter, not a full display-stack leap. Get pixels off-
|
||
chip before making them pretty." Replace the abstract
|
||
`VIDEO_R/G/B/HSYNC/VSYNC/DE` virtual pins with real DE25-Nano
|
||
HDMI transmitter signals; the ADV7513 chip itself stays asleep
|
||
(its I²C wake-up FSM is the Ch165+ chapter), so the bitstream
|
||
makes the FPGA pins toggle correctly but a real monitor stays
|
||
dark until Ch165 lands.
|
||
|
||
**Wrapper change** ([rtl/top/de25_nano_psmct32_raster_demo_top.sv](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv)):
|
||
five new top-level outputs added — `HDMI_TX_CLK` (= `design_clk`,
|
||
the 50 MHz pixel clock), `HDMI_TX_D[23:0]` packing
|
||
`{VIDEO_R, VIDEO_G, VIDEO_B}` (R in MSBs, ADV7513 default 24-bit
|
||
RGB), and `HDMI_TX_HS / HDMI_TX_VS / HDMI_TX_DE` mirroring the
|
||
abstract VIDEO_* signals. The VIDEO_* ports are kept on the
|
||
wrapper as `VIRTUAL_PIN ON` (the Ch149 board TB references them
|
||
via hierarchical probe).
|
||
|
||
**QSF change** ([synth/.../de25_nano_psmct32_raster_demo_top.qsf](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.qsf)):
|
||
HDMI pinout sourced from
|
||
[`retroDE_nes/retroDE_nes.qsf`](../../../retroDE_nes/retroDE_nes.qsf)
|
||
for the same DE25-Nano (Terasic Agilex 5) board — `HDMI_TX_CLK`
|
||
on `PIN_DJ24` with 1.1-V IO standard (matches the on-board level
|
||
shifter), data + sync pins on 3.3-V LVCMOS. The companion
|
||
ADV7513 control pins (`HDMI_I2C_SCL`, `HDMI_I2C_SDA`,
|
||
`HDMI_TX_INT`, `HDMI_MCLK`) are intentionally NOT pinned — the
|
||
chip stays in standby on power-up and ignores its 24-bit RGB
|
||
input until the I²C wake-up FSM lands in Ch165+.
|
||
|
||
**SDC change** ([synth/.../de25_nano_psmct32_raster_demo_top.sdc](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc)):
|
||
`set_false_path -to` for each HDMI output port. Proper
|
||
`set_output_delay` constraints with respect to a generated
|
||
`HDMI_TX_CLK` domain land alongside the Ch165+ wake-up FSM,
|
||
when the ADV7513's actual setup/hold window comes out of the
|
||
chip's datasheet pass.
|
||
|
||
**Scaffold-check extension** ([sim/Makefile](../../sim/Makefile)):
|
||
`top_psmct32_raster_demo_quartus_scaffold_check` now also
|
||
verifies `HDMI_TX_CLK + HDMI_TX_D[0..23] + HS/VS/DE` are
|
||
pin-assigned (sentinel set; not exhaustive) — fails the gate
|
||
if Quartus would auto-place them on arbitrary package pins.
|
||
|
||
**Quartus result vs Ch163 (50 MHz)**:
|
||
|
||
| Metric | Ch163 (50 MHz, no HDMI pins) | Ch164 (50 MHz + HDMI pins) |
|
||
|-----------------------------|-------------------------------|-------------------------------|
|
||
| Fit ALMs | 27,543 / 46,800 (59 %) | 27,271 / 46,800 (58 %) |
|
||
| Fit RAM / PLL blocks | 14 / 1 | 14 / 1 |
|
||
| **Fit pins** | **17 / 351 (5 %)** | **45 / 351 (13 %)** (+28 HDMI) |
|
||
| Setup slack worst (design) | +7.500 ns | +7.536 ns |
|
||
| Fmax (design domain) | 80.0 MHz | ~80 MHz (unchanged) |
|
||
| `quartus_asm` | Successful | Successful (`.sof` produced) |
|
||
|
||
The +28 pins are exactly the new HDMI shim — 24 RGB lanes, 1
|
||
clock, 3 sync (HS / VS / DE). Setup slack stays at ~+7.5 ns
|
||
because the new pins are `false_path`'d — STA doesn't time
|
||
anything against them yet. ALMs ticked down slightly as the
|
||
fitter rebalanced under the wider pin map.
|
||
|
||
**Snapshot**: Ch163 50 MHz baseline preserved at
|
||
[`baseline_ch163_50mhz/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch163_50mhz/)
|
||
(syn / fit / sta summaries + parse_report + .sof). The
|
||
[`baseline_ch163_30mhz/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch163_30mhz/)
|
||
30-MHz milestone is also preserved.
|
||
|
||
**Open Ch165+ items**:
|
||
- **ADV7513 I²C wake-up FSM** — without this the HDMI port
|
||
outputs nothing on a real monitor. Ch165 owns the chip
|
||
bring-up: pin `HDMI_I2C_SCL` / `HDMI_I2C_SDA` /
|
||
`HDMI_TX_INT` / `HDMI_MCLK`, drop in an I²C master that
|
||
walks the canonical ADV7513 register-set (sourced from
|
||
`retroDE_nes`'s working bring-up).
|
||
- Proper `set_output_delay` constraints once the ADV7513
|
||
setup/hold window is documented (replacing Ch164's
|
||
`false_path`).
|
||
- Make the rendered pattern bigger than Ch123's 16×8 SPRITE so
|
||
there's something visible to admire on a real screen.
|
||
- xfer-side T4 coverage TB (still open from Ch157+).
|
||
- `useg_shadow_mem` BRAM-shape forensics.
|
||
|
||
**Sim regression**: 143 PASS / 0 FAIL unchanged — no RTL
|
||
changes that touched sim semantics; the new HDMI ports are
|
||
combinational mirrors of existing VIDEO_* signals, and
|
||
`tb_de25_nano_psmct32_raster_demo_top` references VIDEO_*
|
||
unchanged.
|
||
|
||
### Wake the ADV7513 — first .sof that drives a real HDMI monitor (Ch165)
|
||
|
||
Ch165 turns "FPGA pins toggling" into "monitor has a fighting
|
||
chance of showing the tiny frame" — Codex's framing for the
|
||
chapter. The ADV7513 chip stays in standby on power-up; an I²C
|
||
master needs to walk a canonical register-write sequence to
|
||
configure 24-bit RGB input + sync polarity + power-up + HPD
|
||
override before the chip will accept the FPGA's HDMI_TX_*
|
||
data and drive the connector.
|
||
|
||
**Modules ported** (Terasic DE-series reference design, free
|
||
use on Terasic hardware per the license that ships with the
|
||
DE25-Nano System CD; copyright retained):
|
||
|
||
- [`rtl/platform/I2C_Controller.v`](../../rtl/platform/I2C_Controller.v)
|
||
— bit-bang I²C master with 23-step transaction layout (start /
|
||
slave-addr / sub-addr / data / stop, ~50 µs per byte at the
|
||
derived 20 kHz I²C clock).
|
||
- [`rtl/platform/I2C_HDMI_Config.v`](../../rtl/platform/I2C_HDMI_Config.v)
|
||
— wake-up FSM that walks a 38-entry LUT of ADV7513 register
|
||
writes (slave 0x72): power-up + HPD override + audio init +
|
||
AVI InfoFrame for full-range RGB 444 + dither + clock-divide +
|
||
HDMI mode select. Adapted from the
|
||
`retroDE_splash/rtl/platform/` versions (same DE25-Nano
|
||
board); LUT customizations (HPD override, AVI InfoFrame for
|
||
full-range RGB) carry through.
|
||
|
||
**Wrapper changes** ([rtl/top/de25_nano_psmct32_raster_demo_top.sv](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv)):
|
||
|
||
- Four new top-level ports: `inout HDMI_I2C_SCL`,
|
||
`inout HDMI_I2C_SDA` (open-drain I²C bus), `input HDMI_TX_INT`
|
||
(chip's HPD / monitor-sense interrupt, active-low), and
|
||
`output HDMI_MCLK` (audio sample-rate reference, driven by
|
||
CLOCK2_50 since the demo is video-only).
|
||
- `I2C_HDMI_Config u_hdmi_i2c` instantiated. Clocked on
|
||
`CLOCK2_50` (NOT `design_clk` — the wake-up runs even before
|
||
the PLL locks); reset on `~ninit_done` (raw async reset; the
|
||
I²C bus stays held in a clean state until FPGA init
|
||
completes). Output `READY` (= `hdmi_init_done`) goes high
|
||
after the LUT walk; `HDMI_TX_INT` going low retriggers the
|
||
walk so a late hot-plug after FPGA boot still wakes the chip.
|
||
- New status LED: `LED[3] = ~hdmi_init_done` (active-low; lit
|
||
means the chip is configured). `LED[7:4]` retie at HIGH.
|
||
|
||
**QSF + files.f + sim Makefile**:
|
||
[QSF](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.qsf)
|
||
gains pin assignments for the 4 new control pins (sourced from
|
||
`retroDE_nes`: `BT1` / `BW2` / `CF2` / `CF1`) plus IO standards
|
||
(3.3-V LVCMOS for everything). The two new platform Verilog
|
||
sources are added to the QSF source list, the synth
|
||
[files.f](../../synth/de25_nano/top_psmct32_raster_demo/files.f),
|
||
and the sim Makefile's `RTL_SRCS`. The
|
||
[scaffold-check](../../sim/Makefile)
|
||
extends to verify all 4 control pins are pin-assigned + IO
|
||
standard'd, alongside the Ch164 24-pin HDMI data set.
|
||
|
||
**SDC change**
|
||
([de25_nano_psmct32_raster_demo_top.sdc](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc)):
|
||
`set_false_path -to / -from` on the new control pins. The I²C
|
||
bus runs at ~20 kHz (50 µs per SCL period) and is inherently
|
||
async to the design clock; HDMI_MCLK is driven by CLOCK2_50 and
|
||
sampled by the chip's audio PLL — both well below any
|
||
constraint on the fabric.
|
||
|
||
**Quartus result vs Ch164**:
|
||
|
||
| Metric | Ch164 (HDMI data only) | Ch165 (HDMI data + wake-up) |
|
||
|-------------------------|-------------------------------|-------------------------------|
|
||
| Fit ALMs | 27,271 / 46,800 (58 %) | 27,374 / 46,800 (58 %) |
|
||
| Fit RAM / PLL blocks | 14 / 1 | 14 / 1 (unchanged) |
|
||
| **Fit pins** | **45 / 351** | **49 / 351** (+4 control) |
|
||
| Setup slack worst | +7.536 ns | +7.198 ns |
|
||
| `quartus_asm` | Successful | Successful (`.sof` produced) |
|
||
|
||
The +103 ALMs are the I²C controller's bit-bang state machine
|
||
and the 38-entry LUT walker. STA stays positive on every
|
||
class — the wake-up FSM lives entirely on the I²C-clock domain
|
||
(slow), and Recovery analysis on `iRST_N` async-deassert is
|
||
cleanly +17.621 ns of slack.
|
||
|
||
**TB note** — `tb_de25_nano_psmct32_raster_demo_top` (the
|
||
Ch149 board smoke) wires up the new HDMI_TX_INT input
|
||
(tied high = no interrupt) and leaves the I²C SCL/SDA lines
|
||
floating; the wake-up FSM walks the LUT but full completion
|
||
takes ~125 ms simulated at the production divider
|
||
(controller-clock period ~100 µs × 33 phases/byte × 38 bytes),
|
||
far longer than the existing 5 ms TB runtime. The board TB
|
||
doesn't observe `hdmi_init_done` directly — it pre-dates the
|
||
wake-up FSM and only smoke-tests the wrapper. The Ch165 audit
|
||
landed `tb_hdmi_i2c_wake_smoke` (`sim/tb/top/`), which
|
||
overrides `CLK_Freq / I2C_Freq` to collapse the divider so the
|
||
walk runs in microseconds and asserts the LUT walk + READY
|
||
rise + HDMI_TX_INT retrigger + open-drain SDA + the Ch166
|
||
sticky NACK watchdog. Ch167 added a bus-level byte-sequence
|
||
lock: the TB switched its SDA model from pulldown to
|
||
pullup + a phase-aware slave-ACK driver (drives strong-LOW
|
||
exactly when `u_dut.u0.phase` is `PH_ACK0/1/2`, releases
|
||
otherwise so the master's data bits are visible on the
|
||
wire). A decoder samples SDA on each SCL rising edge
|
||
between START and STOP, assembles the three bytes per
|
||
transaction into a 24-bit `{dev_addr, reg, data}` tuple,
|
||
and compares against `u_dut.mI2C_DATA[23:0]` snapshotted
|
||
on `mI2C_GO` rising edges. Asserts: 38 captured == 38
|
||
intent, every byte matches, every dev_addr is `8'h72`.
|
||
The Phase 3 open-drain check also flipped semantics from
|
||
"SDA never strong-HIGH" to "SDA never `'x`" (the right
|
||
violation test for the pullup bus).
|
||
|
||
**Snapshots**: Ch164 baseline preserved at
|
||
[`baseline_ch164/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch164/);
|
||
Ch165 baseline at
|
||
[`baseline_ch165/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch165/).
|
||
|
||
**Open Ch168+ items**:
|
||
- Proper `set_output_delay` constraints on HDMI_TX_* once the
|
||
ADV7513 setup/hold window is locked from the bring-up
|
||
datasheet pass (replaces the Ch164 `set_false_path -to`).
|
||
- Make the rendered pattern bigger than Ch123's 16×8 SPRITE
|
||
so there's something visible to admire on a real screen.
|
||
- xfer-side T4 coverage TB (still open from Ch157+).
|
||
- `useg_shadow_mem` BRAM-shape forensics.
|
||
|
||
**Sim regression**: 144 PASS / 0 FAIL.
|
||
`tb_de25_nano_psmct32_raster_demo_top` PASSES with the new
|
||
HDMI control ports wired up (HDMI_TX_INT held high in the
|
||
TB; LED=`0b11111000` shows the existing 3 status LEDs lit
|
||
— LED[3] stays unlit because the LUT walk doesn't complete
|
||
in 5 ms of sim). `tb_hdmi_i2c_wake_smoke` PASSES the
|
||
accelerated bring-up + Ch166 NACK-watchdog assertions.
|
||
|
||
### Hardware-readiness pass for the Ch123 PSMCT32 raster demo (Ch144)
|
||
|
||
Ch144 is a synthesis/FPGA-readiness audit around the first
|
||
hardware-demo candidate (Ch123 PSMCT32 raster e2e, marked above).
|
||
No RTL changes — Ch144 documents what a top-level FPGA wrapper
|
||
needs to know before attempting a first build.
|
||
|
||
**RTL dependency tree (Ch123-only)** — what the demo *actually*
|
||
instantiates. The full `RTL_SRCS` list compiled by sim contains
|
||
~40 modules; Ch123 only reaches these 11, plus the swizzle math
|
||
primitive that the three swizzle-aware modules each instantiate
|
||
internally:
|
||
|
||
| Module | Role in Ch123 |
|
||
|------------------------------|-------------------------------------------------------------|
|
||
| `bios_rom_stub` | EE bootlet at 0xBFC0_0000 (~18 instructions) |
|
||
| `ee_ram_stub` | DMAC-side GIF payload (~24 qwords) |
|
||
| `ee_memory_map_stub` | EE-CPU + DMAC + bios + map's GS-priv decode |
|
||
| `ee_core_stub` | MIPS R5900 core running the bootlet |
|
||
| `ee_gs_priv_bridge_stub` | EE 32-bit MMIO → 64-bit GS-priv reg writes |
|
||
| `dmac_reg_stub` | DMAC ch2 NORMAL transfer |
|
||
| `gif_packed_stub` | GIFtag + PACKED A+D parser |
|
||
| `gs_stub` | GS register file + raster (`PSMCT32_SWIZZLE=1`) |
|
||
| `gif_image_xfer_stub` | TRXDIR/IMAGE engine (`PSMCT32_SWIZZLE=1`, dormant in Ch123) |
|
||
| `vram_stub` | 8 KiB VRAM (one PSMCT32 page) |
|
||
| `gs_pcrtc_stub` | PCRTC scanout (`PSMCT32_SWIZZLE=1`) |
|
||
| `gs_swizzle_psmct32_stub` | Pure-comb math, instantiated x3 inside the gates above |
|
||
|
||
**Sim-only constructs audit** (full sweep of the 12 modules
|
||
above):
|
||
- `bios_rom_stub.sv` and `ee_ram_stub.sv` — `$display` /
|
||
`$readmemh` inside `initial begin`. Both are synth-safe:
|
||
Xilinx Vivado and Intel Quartus support `$readmemh` for BRAM
|
||
initialization, and `$display` is silently ignored by all
|
||
major synthesizers.
|
||
- `vram_stub.sv` L114-117 — single `$error` parameter validator
|
||
inside `initial begin`. Synth ignores it; the BYTES parameter
|
||
must be set to a sane value at instantiation regardless.
|
||
- `ee_gs_priv_bridge_stub.sv` L118 — runtime `$error` on
|
||
unsupported byte enables, inside `always_ff`. Synth ignores
|
||
the `$error`; the surrounding logic still synthesizes
|
||
correctly.
|
||
- **No** `$finish` / `$dumpfile` / `$random` / `force` /
|
||
`release` / `real`-typed signals / hierarchical refs in any
|
||
module of the **Ch123 dep tree**. (TBs use hierarchical refs
|
||
into `bios_rom_stub` to preload the bootlet — that's a TB-
|
||
only concern; on hardware the bootlet image is the BRAM init.
|
||
Out-of-tree note: `boot_install_agent_stub.sv` (SIF subsystem,
|
||
not in the Ch123 dep tree) contains a `$fatal` runtime
|
||
validator, but it is never compiled into the Ch123 hardware
|
||
build.)
|
||
|
||
**Memory sizing**:
|
||
|
||
| Memory | Default | Ch123 sim setting | Ch123 hw recommendation | FPGA fit |
|
||
|--------------------|---------------|-------------------|-------------------------|----------------------------------|
|
||
| `bios_rom_stub` | 4 MiB | 4 KiB | 4 KiB | ≤1 BRAM tile |
|
||
| `ee_ram_stub` | 16 KiB | 4 KiB | 4 KiB | ≤1 BRAM tile |
|
||
| `vram_stub` | 64 KiB | 8 KiB | 8 KiB | ≤2 BRAM tiles (one PSMCT32 page) |
|
||
| `ee_memory_map_stub.useg_shadow_mem` (Ch145) | 4 MiB | 4 MiB | **4 KiB** (override `USEG_SHADOW_WORDS_PARAM=1024`) | ≤1 BRAM tile when overridden |
|
||
|
||
The 16×8 framebuffer needs only 16×8×4 = 512 bytes; 8 KiB gives
|
||
the full first PSMCT32 page (FBP=0). For a more ambitious
|
||
hardware demo (multi-page framebuffers, textures), grow
|
||
`vram_stub.BYTES` toward 1 MiB / 4 MiB. Real PS2 has 4 MiB of
|
||
VRAM; a first hardware build can stay at 8 KiB.
|
||
|
||
**Ch145 — `useg_shadow_mem` parameterization**: pre-Ch145, the
|
||
ee_memory_map_stub's useg-shadow backing was a fixed 1M-word /
|
||
4 MiB array. That was correct for the BIOS-smoke chapters that
|
||
need full first-4-MiB-of-useg coverage, but it's wasted area
|
||
for the Ch123 hardware demo (which never touches useg — the
|
||
bootlet runs from BIOS at 0xBFC0_0000 and the GIF payload from
|
||
RAM at phys 0x100). Ch145 promotes `USEG_SHADOW_WORDS` from a
|
||
hardcoded `localparam` to the `USEG_SHADOW_WORDS_PARAM` module
|
||
parameter (default 1M words = 4 MiB → existing TBs unchanged).
|
||
For the Ch123 hardware demo, the top-level wrapper instantiates
|
||
`ee_memory_map_stub` with `.USEG_SHADOW_WORDS_PARAM(1024)` to
|
||
shrink the inferred BRAM footprint by ~1024×; correctness is
|
||
unaffected because no useg access ever happens in the Ch123
|
||
data plane.
|
||
|
||
**Clock / reset assumptions**:
|
||
- Single clock domain (`clk`) — all 12 modules share one input.
|
||
- Active-low synchronous reset input (`rst_n`) — also a single
|
||
shared input. No reset gating, no per-module variants. The
|
||
reset is sampled inside `always_ff @(posedge clk)` via the
|
||
`if (!rst_n)` pattern (NOT `posedge clk or negedge rst_n`) —
|
||
i.e., it is NOT an async reset despite being active-low. On
|
||
FPGA this should be brought up via the device's reset bridge
|
||
so the deasserting edge is synchronous to `clk`.
|
||
- No clock gating, no derived clocks. The PCRTC's hsync/vsync/de
|
||
are regular clock-domain outputs, not separate clocks.
|
||
|
||
**Swizzle gate parameter defaults**:
|
||
- All four swizzle parameters (`PSMCT32_SWIZZLE`,
|
||
`PSMCT16_SWIZZLE`, `PSMT8_SWIZZLE`, `PSMT4_SWIZZLE`) default
|
||
to `1'b0` on `gs_stub`, `gs_pcrtc_stub`, and
|
||
`gif_image_xfer_stub`. For the Ch123 hardware demo,
|
||
instantiate these three modules with **`PSMCT32_SWIZZLE(1'b1)`**
|
||
and the other three left at `1'b0`. The swizzle-math
|
||
primitives (`gs_swizzle_psmct32_stub` etc.) are pure-comb and
|
||
trim cleanly when their gate is off.
|
||
|
||
**Top-level harness expectations** (for a future
|
||
`top_psmct32_raster_demo.sv`):
|
||
- Inputs: `clk`, `rst_n`, plus board-level video-out connections
|
||
(HDMI / DVI / VGA — driven by `r/g/b/hsync/vsync/de` from
|
||
`gs_pcrtc_stub`).
|
||
- The EE bootlet image must be preloaded into `bios_rom_stub`
|
||
via either `IMAGE_FILE` (→ `$readmemh`) or a bake-step that
|
||
writes a `.mem` next to the synthesis project. The bootlet is
|
||
18 MIPS instructions (currently authored procedurally in the
|
||
Ch123 TB body via `ee_prog_word()`); for hardware this needs
|
||
to become a static `.mem` checked into the repo.
|
||
- The GIF payload must be preloaded into `ee_ram_stub` via the
|
||
same mechanism — 24 qwords starting at `PAYLOAD_MADR=0x100`.
|
||
Current TB authors them procedurally with `preload_qword()`;
|
||
hardware needs a static `.mem`.
|
||
- The `core_go` signal must be tied high (or pulsed by a board
|
||
reset-release sequencer) so the EE starts fetching from
|
||
`0xBFC0_0000`.
|
||
|
||
**Known sim-only constructs that should NOT block first build**:
|
||
- `$display` lines in BIOS/RAM init (synth ignores).
|
||
- `$readmemh` (synth tools handle it for BRAM init).
|
||
- `$error` parameter validators (synth ignores).
|
||
|
||
**Known sim-only constructs that WOULD block first build**:
|
||
- None found in the Ch123 dep tree.
|
||
|
||
**Open questions for the hardware-build session** (deliberately
|
||
not answered here — they need a board-level decision):
|
||
- Target FPGA family + clock frequency (PCRTC was designed
|
||
around 13.5 MHz pixel clock for the 16×8 active area; first
|
||
build can run at any clock since the TB doesn't model real
|
||
CRTC timing).
|
||
- Video-out PHY (HDMI core, VGA DAC, on-board HDMI
|
||
transmitter chip).
|
||
- BIOS / payload bake step (Vivado `update_compile_order` +
|
||
`.mem` files vs. a SystemVerilog `localparam` array
|
||
preload).
|
||
- Whether to keep `ee_core_stub`'s `STRICT_UNSUPPORTED` gate
|
||
active on hardware (catches unknown opcodes by halt+latch —
|
||
useful for debugging, but a hard failure on any unintended
|
||
fetch).
|
||
|
||
The Ch90 white-box TB `tb_gs_scanout_basic.sv` exercises the
|
||
full round trip: instantiates `gs_stub` + `vram_stub` +
|
||
`gs_pcrtc_stub`, drives a 4×4 sprite through the GIF reg port,
|
||
waits for raster to fully drain, then enables scanout and
|
||
captures one full frame's `(hcnt, vcnt) → (r, g, b)` trace.
|
||
Asserts: every pixel inside the sprite reads as the emitted
|
||
color, every pixel outside reads as 0, and at least one EV_MODE
|
||
frame trace fires.
|