# 0010 — On-chip tile-local renderer core (first tiled-VRAM rung)

**Status:** proven in sim (Ch303), board-pending. One 16×16 tile, on-chip color+Z,
flush to VRAM. Texture still BRAM-VRAM. NO LPDDR, NO multi-tile binning yet.

## Why

doc 0009 established the per-pixel requirement for a combined textured+alpha+depth
primitive: **hidden = 1 read / 0 writes; visible = 3 reads / 2 writes.** doc 0008 §6
sets the target architecture: on-chip tile color+Z (RMW resolved on-chip), LPDDR as
backing VRAM, texture streamed/cached. This rung builds the **first piece**: the
on-chip tile color+Z scratchpad with flush, so the combined RMW happens on-chip and
only the texture fetch + the tile flush cross to VRAM. (Codex framing: build the
tile-local core first; stage LPDDR integration later.)

## What was built

- **`gs_tile_ram`** (rtl/gif_gs/gs_tile_ram.sv): generic 1W1R on-chip tile RAM,
  registered read (1-cycle, matching the VRAM read2 contract). Instantiated twice
  in gs_stub (gated by `TILE_LOCAL`): `u_tile_color` (256×32) + `u_tile_z` (256×32)
  — one 16×16 tile, ~2 KiB total.
- **gs_stub `TILE_LOCAL` mode** (default 0 → byte-identical): a combined TME+ABE+ZTE
  triangle renders into the tile via a CLEAR → RENDER → FLUSH sequence overlaid on
  the existing R_IDLE/R_SCAN/R_DRAIN FSM.

## The tile memory schedule (the deliverable)

```
CLEAR   : 256 cycles → write tile_color = TILE_CLEAR_COLOR, tile_z = TILE_CLEAR_Z
          (every entry initialized; the "background")

RENDER  (per inside pixel, tile index = {y[3:0], x[3:0]}):
  beat0   read  tile_z
  beat1   Z-test (GEQUAL: frag_z vs tile_z).  FAIL → STOP (no texture read,
          no tile_color read, no tile_color/tile_z write).  PASS → read texture (VRAM)
  beat2   texel ready (Cs/As) ; read tile_color (dest)
  beat3   blend ; WRITE tile_color
  beat4   WRITE tile_z (skip on ZMSK)

FLUSH   : 256 cycles → read tile_color[idx] (registered) → framebuffer write
          (raster_pixel_emit → VRAM at the linear FB address). ~70 MB/s class
          per doc 0008 — trivial.
```

In tile terms:
- **hidden pixel:** tile_z read only. No texture, no tile_color read/write, no tile_z write.
- **visible pixel:** tile_z read + texture read + tile_color read + tile_color write + tile_z write.
- **flush:** tile_color read → framebuffer write, ×256.

Texture stays on the VRAM read2 path (unchanged). Only color/Z moved on-chip.

## Verification (tb_top_psmct32_tile_demo)

Combined triangle (interpolated Z crossing the clear Z) over a CLEAR'd green tile.
A tracer on the tile-RAM ports + the emit port asserts the schedule:
- CLEAR wrote 256 color + 256 Z entries.
- hidden (depth-fail) pixels: no texture read, no tile_color write, no tile_z write.
- visible (depth-pass) pixels: texture read + tile_color write + tile_z write; rendered
  color = blend(texel, clear-green); occluded/outside = clear green.
- FLUSH emitted 256 framebuffer writes; final scanout matches the Ch302 image.
Result: clear 256/256, flush 256, 35 visible / 7 hidden, errors=0. `TILE_LOCAL=0`
keeps every prior demo byte-identical.

## External LPDDR bandwidth model (documented, not yet exercised)

Per doc 0008: framebuffer flush is **trivial** (640×480×4 B ≈ 1.2 MB/frame ≈ 70 MB/s
@60fps, noise vs the measured 8.4 GB/s). Texture fetch + tile-reload from primitive
disorder are the real DDR consumers and are **locality-driven** — to be measured
against real GS traces (doc 0008 §4–5), not synthesized. This rung does NOT touch
LPDDR: texture is BRAM-VRAM, one tile, no eviction.

## Ch304 — 2×2 multi-tile grid (extension)

The single-tile core generalizes to a `TILE_COLS×TILE_ROWS` grid with minimal
change, because (a) `tile_idx = {y[3:0],x[3:0]}` is already the tile-local address
for any 16-aligned tile, and (b) attribute interpolation is screen-space → seams
are continuous by construction. Added: an outer tile loop (the popped primitive +
solved gradients persist across all tiles), per-tile walker-bbox clip to
`primitive_bbox ∩ tile` (skip render if no overlap → tile shows clear color), and
a flush FB-address offset by the tile origin. `TILE_COLS=TILE_ROWS=1` is byte-
identical to the single-tile path. Codex scope: fixed primitive list (one
primitive re-rendered per tile), re-test-against-each-tile, NO external bin memory.

Proven (tb_top_psmct32_tile2x2_demo): one triangle spanning a 2×2 grid (32×32,
crossing x=16 & y=16) — all 4 tiles clear independently (256 each), 1024 flush
emits, and the **whole 32×32 scanout matches a single screen-space reference**
(718/718, 67 seam-region pixels continuous) → no visible seams. This is the
re-test-each-primitive-against-each-tile architecture; a real binning engine /
command buffer is a later optimization, not needed for the architectural proof.

## Ch305 — MULTI-PRIMITIVE tiled scene (extension)

Generalizes the grid from re-rendering ONE primitive per tile to compositing a
LIST of primitives per tile, in order, so later primitives depth-test/alpha-blend
over earlier ones within each tile. Gated by `TILE_MULTIPRIM` (default 0 →
byte-identical) + `TILE_PRIM_COUNT` (batch size). The primitive FIFO IS the list
store (its slots already hold each primitive's pre-solved gradients), so re-reading
a slot is free. Per tile: CLEAR → load+render prim 0 → (pipeline-flush) → load+render
prim 1 → … → FLUSH. The grid starts only once the whole batch is buffered
(`fifo_count >= TILE_PRIM_COUNT && all_grad_done`) — the demo-honest stand-in for a
future GIF-EOP/kick. Empty-clip primitives are skipped per tile; the whole FIFO is
drained at grid end (streaming/partial-drain is future work). The inter-primitive
advance waits for the per-pixel pipeline to fully flush (`comb_pipe_empty`), not just
the walker reaching R_DRAIN, so a primitive's in-flight color/Z writes commit before
the next primitive's `ras_*` load.

Proven (tb_top_psmct32_tile_multiprim_demo): 3 combined prims over the 2×2 grid —
opaque blue bg (Z=0x5000), opaque red (Z=0x6000, in front), translucent white
(Z=0x5800, blends but is OCCLUDED by red where 0x5800 < 0x6000). The whole 32×32
scanout matches a software integer-Z-buffer + source-over replay (514/514), with all
interaction regions exercised (blue 24 / red 48 / light-blue 26 / occlusion 19 /
green 416) and seam continuity (50 seam matches). This is the architecture a real
command-stream replay needs; a per-tile bin buffer is a later optimization.

DEBUG NOTES (two non-obvious bugs surfaced): (1) the per-primitive clip wires indexed
a FIFO array through a function inside continuous `assign`s — iverilog-12 mis-reads
that as 0 (silent hang, sim-time frozen); fixed by computing them in `always_comb`
(legal SV, Quartus-clean; sim-only workaround). (2) The first failing image was a
FIXTURE bug, not RTL: three solid 4×4 textures placed 0x100 apart aliased, because a
PSMCT32 texture with TBW=1 has a 0x100-byte row stride so a 4-tall texture spans
0x400 bytes; spacing them 0x400 apart (TBP0 32/36/40) fixed it. The depth/RMW path
was correct all along.

## Ch306 — GS SCISSOR clipping (extension)

Bakes the GS SCISSOR_1 rectangle into the tile-traversal walker bounds (param
`SCISSOR_ENABLE`, default 0 → byte-identical). Because the scissor is a rectangle and
the walker scans a rectangle, the effective draw region = primitive bbox ∩ tile bbox ∩
scissor rect is itself rectangular — so the scissor is just intersected into the
walker bbox at all clip sites (single-prim `clip_*`, multiprim `always_comb`, and the
`mp_next_nonempty` empty-test). NO per-pixel scissor test: pixels outside the scissor
are never visited, so color and Z writes are both suppressed for free. SCISSOR_1 (GIF
reg 0x40) is parsed into a GLOBAL `scissor_1_q` (reset full-range); decoded fields →
12-bit `eff_sc*` gated by SCISSOR_ENABLE (0/0xFFF when off → max/min no-op). Per-
primitive (FIFO-snapshot) scissor is a future extension if a command stream varies it.

Proven (tb_top_psmct32_tile_scissor_demo): the Ch305 3-prim scene + SCISSOR_1
[9..22]×[6..20] (crossing both seams) — 514/514 match, clipped=39 (would-be-scene
pixels outside the rect are clear green), inside=59 (kept scene matches the unclipped
ref), exact boundary (edgePairs=6), seam=50. Regression 209→210, byte-identical.

## Ch307 — texture WRAP modes (REPEAT + CLAMP) (extension)

Adds GS texture wrap (CLAMP_1 WMS/WMT: REPEAT/CLAMP) for u/v, inside `gs_texture_unit`
(param `TEX_WRAP_ENABLE`, default 0 → pass-through byte-identical). Applied to u/v
BEFORE texel-address gen, so it covers the linear and swizzle paths and all callers at
one point. REPEAT = `u & (2^TW - 1)`; CLAMP = `min(u, 2^TW-1)`. gs_stub parses CLAMP_1
(reg 0x48) and snapshots wrap mode + TW/TH per primitive (FIFO, like ras_tbw), so
REPEAT and CLAMP primitives coexist in one scene. Codex sequencing: wrap/clamp before
bilinear, because it determines which edge neighbours a future bilinear filter samples.

Proven: a standalone sampler TB (tb_gs_texture_wrap) covers PSMCT32 + PSMT8 +
PSMT4-swizzle (wrap happens before swizzle); the board TB (tb_top_psmct32_tile_wrap_demo,
557/557) renders two textured tris sampling a striped 4×4 texture with UV 0..8 — REPEAT
tiles 2× (two white stripes), CLAMP sticks (one white stripe + edge-stretched blue).
Regression 210→212, byte-identical. (Fixture lesson: the first NON-solid tile texture
exposed an upload-giftag REGS nibble-count bug that solid textures had masked.)

## Ch308 — PSMCT16 tile color buffer (extension)

The on-chip tile COLOR RAM can be PSMCT16 (RGB5A1, 16-bit) instead of PSMCT32, via
param `TILE_COLOR_PSMCT16` (default 0 → byte-identical). It HALVES the color tile RAM
(`TILE_COLOR_W` = 16; Z RAM stays 32-bit) — the first answer to "can tile color be
narrower than the 32-bit blend width when the frame format allows it?" (yes). The RMW
packs ABGR8888→pix16 on write/clear, unpacks pix16→ABGR (bit-replicate) for the blend
dest, and the FLUSH emits PSMCT16 framebuffer writes (mirroring the proven S2 PSMCT16
emit; vram_normalize keys the halfword off byte_addr[1]). Scanout reads PSMCT16 via
DISPFB1.PSM=0x02.

CONSTRAINT discovered: the combined tile path's primitive eligibility requires
FRAME.PSM==PSMCT32 (the combined RMW was built PSMCT32-only). So the PSMCT16-ness lives
in the tile RAM + flush + DISPFB, NOT in FRAME.PSM — the demo keeps FRAME PSMCT32 (so
the prims classify as combined) while DISPFB + tile + flush are PSMCT16. A fully-PSMCT16
FRAME would need the combined gate relaxed to accept PSMCT16 dest (future work).

Proven (tb_top_psmct32_tile_psmct16_demo, 514/514): the Ch305 scene in PSMCT16, matched
against a software reference that applies the SAME per-step 5-bit quantization
(q(c)=(c&0xF8)|(c>>5)) the on-chip RMW does — each primitive blends over the quantized
dest. Clear green 0x80→0x84, light-blue 0x7F→0x7B, pure blue/red unchanged, red
occlusion intact. Regression 212→213, byte-identical.

## Ch309 — generic GS ALPHA blend modes (extension)

Generalizes the combined blender from the single hardcoded source-over to the GS
selector machinery `Cv = clamp(((A-B)*C)>>7 + D)` (A/B/C/D from ALPHA_1, FIX=[39:32]),
param `ALPHA_MODES_ENABLE` (default 0 → source-over, byte-identical). gs_alpha_blend
gains a_sel/b_sel/c_sel/d_sel/ad/fix inputs + a generic datapath; gs_stub FIFO-snapshots
the per-primitive selectors+FIX and wires them to u_comb_blend. The combined-eligibility
gate `close_combined` (which hardcoded source-over) is relaxed to accept any ABE
primitive when ALPHA_MODES_ENABLE — the generic blender handles any config. (Same class
of "eligibility gate too strict" as Ch308's PSMCT32-FRAME requirement: when you add a
per-pixel mode to the combined path, check the datapath AND close_combined.)

Proven (tb_top_psmct32_tile_alpha_demo, 514/514): the Ch305 scene with P1 ADDITIVE
(A=Cs,B=0,C=FIX=0x80,D=Cd → Cs+Cd) → magenta over the blue bg (glow/particle add), while
P0/P2 stay source-over (light-blue intact) and P2 is still depth-occluded by P1. Two
blend modes coexist. Regression 213→214, byte-identical.

## Ch310 — bilinear texture filtering (extension, 2-phase)

4-tap bilinear (PSMCT32), staged per Codex. PHASE 1: a multi-beat bilinear sampler in
gs_texture_unit (param BILINEAR_ENABLE, default 0): reads the 4 neighbours (each via the
Ch307 wrap), lerps by fractional U/V; schedule = 4·(1+RD_LATENCY)+1 ≈ 9 cyc/sample (the
architectural number for the future texture cache). Proven by a standalone TB
(tb_gs_texture_bilinear) — all 6 cases exact (center=nearest, halfway=4-tap avg, clamp
edge no-OOB, repeat edge wraps, nearest unchanged). PHASE 2: integrated into the COMBINED
tile path — TEX1.MMAG (GIF 0x14 bit5) per-primitive selects nearest vs linear; a runtime
`filter_lin` input gates the 4-tap; the affine interp gains a frac sibling
(interp_affine_uv_frac → step[15:12]); a new CB_TWAIT beat stalls the per-pixel FSM on the
LEVEL !tex_busy until the ~9-cycle sample completes (the FSM steps half-rate on z_advance,
so a level wait can't miss the 1-cycle out_valid), then CB_T latches the HELD filtered texel.
Depth/Z/blend/tile-RMW unchanged; bilinear did NOT touch close_combined (the prim is still
source-over ABE). Proven (tb_top_psmct32_tile_bilinear_demo): a magnified 4×4 blue/white
checker, nearest tri blocky (0 midtones) vs bilinear tri smoothed (all midtones), same
coverage (stall dropped nothing). Regression 215→216, byte-identical.

## Ch311 — per-tile BIN BUFFER (extension)

Replaces Ch305's render-time re-test (mp_next_nonempty: each tile re-scans all prims) with
a real precomputed bin buffer (param BIN_BUFFER_ENABLE, default 0). A new TP_BIN phase runs
a (prim,tile) double-loop counter FSM (prim_count×NTILES cycles) that tests each prim's
bbox∩tile∩scissor (the same overlap math) and appends the prim index to bin_prim[tile][] /
bin_n[tile], in ascending draw order. The render then walks each tile's bin (CLEAR-done loads
bin slot 0; RENDER-drain steps through bin_n; FLUSH at end) — no re-scan. Equivalent image
to the re-test path (same overlap test + order). This is the primitive-ROUTING machinery for
command-stream replay; the grid stays 2×2 (prove the mechanism, scale later). Proven
(tb_top_psmct32_tile_bin_demo): bins read back exactly (t0={0,1} t1={0,1} t2={0} t3={0,2} for
an all-tiles/2-tiles/1-tile prim trio) + image 594/594 vs the re-test reference. Regression
216→217, byte-identical.

## Next (staged, per Codex)
1. Multiple tiles / tile grid (primitive→tile binning).  [DONE: Ch304 grid, Ch305 list, Ch311 bin buffer]
   Scissor/window clipping.  [DONE: Ch306]
   Texture clamp/repeat.  [DONE: Ch307]
   PSMCT16 tile color.  [DONE: Ch308]
   ALPHA mode expansion.  [DONE: Ch309]
   Bilinear filtering.  [DONE: Ch310 — sampler + combined-path integration]
   Larger grid sweep.  [DONE: Ch312 — 2x2→4x4 (16 tiles, 64x64) via the bin buffer; no new RTL logic, COLS/ROWS/NTILES already parameterized]

## Ch312 — 4x4 grid (extension)

Scales the tiled renderer to a 4×4 grid (16 tiles, 16×16 each = 64×64) by setting
TILE_COLS=TILE_ROWS=4 — NO new RTL logic, since the grid loop + bin buffer (NTILES,
CUR_T_W/BIN_T_W via $clog2) were already parameterized. 64×64 PSMCT32 FB fills 16 KiB so
the demo uses VRAM 32 KiB (textures @ 0x4000). Proven (tb_top_psmct32_tile_bin4x4_demo):
3 prims (P0 4-tile / P1 6-tile cross-seam / P2 1-tile, + 6 empty tiles), all 16 bin_n
read back exactly (1100 1211 0111 0001, empty=0), t5={0,1} order preserved, image 3240/3240
vs the re-test reference, seam continuity across x=16/32/48 + y=16/32. Regression 217→218,
byte-identical. The fit (owner) gives the resource-scaling answer: bin storage grows 4×
(60→240 register bits, still tiny) — a hard ALM/register jump would signal the register-bins
should go BRAM/MLAB-backed before larger scenes.

## Ch313 — full PSMCT16 framebuffer mode (extension)

Relaxes the `close_combined` eligibility gate so the combined/tiled path accepts a
PSMCT16 dest (`frame_1_q[29:24]==6'h02`) — but ONLY when `TILE_COLOR_PSMCT16=1`, so a
PSMCT16 FRAME never pairs with a PSMCT32 flush. This was the LAST place forcing a
PSMCT32 FRAME: the tile color RAM, the dest-color unpack for blending, and the flush
emit (be=`4'b0011`, psm=`0x02`, `<<1` byte addr) were ALL already PSMCT16 from Ch308,
keyed off `TILE_COLOR_PSMCT16` and independent of `FRAME.PSM`. So Ch308's PSMCT32-FRAME
workaround is gone — render/flush/scanout are now consistently RGB5A1. One-term RTL
change; at `TILE_COLOR_PSMCT16=0` (default) the new disjunct is constant-0 and the gate
collapses to the original PSMCT32-only test (byte-identical). Demo = the Ch312 4×4
(64×64) scene with `FRAME.PSM=PSMCT16` + DISPFB PSMCT16. A 64×64 PSMCT16 FB is 8 KiB —
HALF the 16 KiB PSMCT32 FB — so the demo runs in **16 KiB VRAM vs Ch312's 32 KiB**: the
direct framebuffer-memory saving that motivates the LPDDR-backed FB phase. Proven
(tb_top_psmct32_tile_psmct16fb_demo): flush 4096/4096 carry psm=0x02 + be=`0011`, ZERO
PSMCT32 flushes (whole FB is 16-bit); image 3240/3240 vs a re-test reference replayed
with per-step RGB5A1 quantization `q5(c)=(c&0xF8)|(c>>5)` (EXACT); 2875 matched pixels
differ from the would-be PSMCT32 value (proves the FB is genuinely RGB5A1, not PSMCT32);
bin_n/scissor/depth identical to Ch312 (1100 1211 0111 0001, t5={0,1}, seam 464).
FIT-CLEARED + VISUALLY VERIFIED on Agilex 5 (2026-06-01): vs Ch312 (4×4 PSMCT32, 32 KiB),
RAM blocks 45→29 (−16), block-mem 688,128→421,888 (−256 Kbit), ALMs −159, regs −555 — the
PSMCT16 FB recovered ALL of Ch312's 4×-scale-up memory cost, landing back on the Ch311 2×2
PSMCT32 baseline of 29 RAM blocks. A 4× grid in PSMCT16 costs ZERO extra framebuffer memory
vs the 2×2 PSMCT32 grid: hard proof the framebuffer (not bins/logic) is the on-chip memory
consumer and pixel format trades directly against it. Board image matches (blue/red/teal
tris + green, RGB5A1-quantized, no seams).

## Ch314 — bilinear for palettized (PSMT8/PSMT4) textures (extension)

Extends bilinear to INDEXED textures with the CLUT-BEFORE-INTERPOLATE rule: each of the 4
taps fetches an index, CLUTs it to RGBA, then the 4 COLORS interpolate (NOT the indices —
that would round to one palette entry). The sampler core is ~6 lines: the bilinear FSM tap
capture changes from `tap[beat] <= tex_rd_data` to `tap[beat] <= near_color` (already
`(PSMT8||PSMT4)?clut_rd_data:tex_rd_data`), so PSMCT32 is byte-identical and indexed taps
capture the CLUT'd color. New param PALETTE_BILINEAR (default 0) widens `do_lin` to admit
PSMT8(0x13)/PSMT4(0x14). The per-tap addr-gen (linear/swizzle + wrap/clamp) already runs
BEFORE the CLUT lookup, so "swizzle-before-CLUT" + edge wrap/clamp are free. For the BOARD
demo (bilinear lives only in the combined path), `close_combined`'s texture-PSM gate also
widens to admit PSMT8/PSMT4 when PALETTE_BILINEAR; the shared gs_texture_unit already had
the CLUT port wired and CLUT is a combinational 3rd port (no read2-arbitration change).
Proven: tb_gs_texture_bilinear (unit) CASE7 PSMT8 red↔blue halfway → 0xFF7F007F (purple,
neither endpoint = colors interpolated), CASE10 PSMT4 nibble across a byte boundary, CASE11
repeat / CASE12 clamp edges + no OOB, CASE1-6 PSMCT32 byte-identical;
tb_top_psmct32_tile_palbilinear_demo (board, combined path + CLUT load) nearMid=0 /
bilMid=58 — the on-board CLUT-before-interp proof. Regression 219→220 byte-identical, board
elab EXIT 0. FIT-CLEARED + VISUALLY VERIFIED on Agilex 5 (2026-06-01): LEFT tri blocky
blue/white indexed checker, RIGHT tri smoothed blue↔white midtones (CLUT-before-interp on
silicon). RESOURCE DELTA vs Ch310 (2×2 PSMCT32 bilinear, 16KB): ALMs 30,229→30,101 (flat),
DSP 122→122 (0), block-mem 425,984→425,984 (0), RAM blocks 29→29 (0) — palettized bilinear
is essentially FREE: zero extra DSP (reuses the same lerp multipliers), zero extra memory
(reuses the CLUT port from Ch296), the CLUT-before-interp restructure is just a mux on the
tap-capture path.

## Ch315 — primitive/bin capacity scaling (extension)

Parameterizes the primitive FIFO depth (was a hardcoded `FIFO_DEPTH=4`) as `TILE_FIFO_DEPTH`
(default 4 → byte-identical; power-of-2). In the bin-buffer renderer this depth sizes BOTH
the prim-list capacity N AND the per-tile bin depth M (bins are `[NTILES][FIFO_DEPTH]`),
so they're coupled (M=N: a tile's bin can hold every queued prim). Adds sim-visible
diagnostics (`raster_overflow_count_r`, `bin_occ_max_r`, defensive `bin_overflow_r`).
ARCHITECTURAL ANSWER to "where do register bins stop being reasonable": the dominant cost
is the ~40 `fifo_*` per-prim attribute arrays (hundreds of register bits/slot); the bins
add only `NTILES*FIFO_CNT_W` index bits per depth (~48 bits/depth at 4×4 — negligible), so
register bins stay cheap far past the FIFO's practical limit. OVERFLOW nuance: the batched
tile path triggers at `TILE_PRIM_COUNT` and drains the FIFO, so excess prims are CLAMPED
(visible as capped bin occupancy), not push-dropped — `raster_overflow` (the streaming
push-while-full flag, now counted) doesn't fire in the batched path; and `TILE_PRIM_COUNT`
must be `<= FIFO_DEPTH`. Proven: tb_top_psmct32_tile_cap_demo (depth 8, 7 prims) — bin t0
holds 6 (occ_max=6 > old 4), draw order {0..5}, image 3873/3873, no overflow;
tb_top_psmct32_tile_cap_overflow_demo (depth 4, same payload) — occ_max CLAMPS to 4
(capacity ceiling) and still renders all 16 tiles gracefully. Regression 220→222
byte-identical, board elab EXIT 0. (Demo puts the deep bin in t0 to dodge an orthogonal
latent bug: empty tiles preceding the first non-empty tile flush black — to be fixed
separately.) FIT-CLEARED + VISUALLY VERIFIED on Agilex 5 (2026-06-01). RESOURCE SLOPE
(depth-8 vs Ch312 depth-4): ALMs 29,682→32,072 (+2,390), regs 33,356→37,486 (+4,130),
block-mem + RAM-blocks UNCHANGED (688,128 / 45). So +4 FIFO slots = ~1,033 regs + ~600 ALMs
PER primitive slot, ZERO block RAM — dominated by the ~40 fifo_* attribute arrays; the bins
add ~80 regs/slot (negligible). The bins never stop being reasonable; the per-prim attribute
FIFO is the ALM-bound scaling wall (~16-prim headroom at this grid). Beyond that, move the
per-prim attribute storage (not the bins) to block RAM.

## Ch316 — leading-empty-tile traversal fix (correctness)

Fixes the latent bug found in Ch315: tiles that are EMPTY and PRECEDE the first non-empty
tile flushed BLACK instead of the clear colour. ROOT CAUSE: the per-tile flush row-stride
is `flush_pixel_index_w = flush_y*(ras_fbw<<6)+flush_x` (gs_stub ~line 3408), and `ras_fbw`
(FRAME width) was loaded ONLY by `mp_load_prim` (on primitive load). A leading-empty tile
loads no prim, so it used the reset `ras_fbw=0` → stride 0 → every row collapsed onto row
0's FB addresses → the tile's real screen rows kept the FB-init value (black). Empties AFTER
a render inherited that render's ras_fbw, hence were fine — the exact asymmetry observed.
FIX: in the `mp_grid_start` branch (~line 5588) load `ras_fbp/ras_fbw/ras_psm/ras_bpp_shift`
from the batch's oldest FIFO entry (`fifo_*[fifo_rptr]`) at GRID-RENDER START, so the flush
address is valid for EVERY tile. A batch shares one FRAME, so this equals what `mp_load_prim`
sets at render → byte-identical for any batch whose first tile is non-empty. Proven:
tb_top_psmct32_tile_late_demo (1 prim only in t15, t0..t14 empty) — ZERO black pixels, all
empty tiles green-cleared, bin_n[15]=1 (renderer reached the last tile, no premature done),
image 3990/3990; Ch315 cap_demo still 3873/3873. Regression 222→223 byte-identical, board
elab EXIT 0 (GS_TILE_LATE_DEMO). Root-caused with a direct VRAM probe (FB at empty tiles
0x0 → 0xFF008000 after the fix). FIT-CLEARED + VISUALLY VERIFIED on Agilex 5 (2026-06-01):
whole 64×64 green (all 15 leading empties clear correctly) + one blue triangle in t15;
resources on the Ch312 baseline (ALMs 29,801, regs 32,442, block-mem/RAM unchanged) — the
fix adds zero storage (loads 4 existing ras_* regs at grid start), pure control-flow.

## Ch317 — LPDDR-backed framebuffer, tile-flush only (sim write/readback proof)

First external-framebuffer step, deliberately tight: ONLY the PSMCT16 tile FLUSH is
redirected to an LPDDR framebuffer; tile color/Z + texture stay on-chip. The proven LPDDR
path (doc 0008, 8.4 GB/s, 256-bit AXI4 → EMIF hard-IP) lives in a SEPARATE diagnostic core,
not the GS top — so this rung proves the write/readback path against a behavioral LPDDR
MODEL (no board fit; wiring the real EMIF master + LPDDR scanout into the GS top is the next
rung). New module `gs_lpddr_fb_writer.sv`: a staging FIFO + burst engine (coalesces a
contiguous +2 run into one burst, 4 KiB cap per the doc 0008 AXI lesson) + byte-addressed
backing FB + bandwidth/over-underflow counters. Consumes the existing flush stream
(`raster_pixel_fb_addr_q` is already the linear `fb_base+(y*pitch+x)*2`). Integrated into
the bram top generate-guarded by `LPDDR_FB_ENABLE` (default 0 → not instantiated,
byte-identical), as a transitional ADDITIVE mirror (BRAM FB still feeds scanout; LPDDR is
the readback-proof target). Proven: tb_gs_lpddr_fb_writer (256-px tile → 512 B / 16 bursts;
2049-px run → 2 bursts via the 4 KiB cap; enable=0 inert) and tb_top_psmct32_tile_lpddrfb_demo
(Ch313 PSMCT16 scene → LPDDR FB == BRAM FB for all 4096 px; 8192 bytes; 256× 32-B bursts; no
over/underflow; ~0.20 GB/s @100 MHz model). Regression 223→225 byte-identical, board elab
EXIT 0 (writer pruned at default). FIX worth noting: a `PTR_W'(FIFO_DEPTH)` truncation read
the FIFO empty-as-full; use `count[PTR_W]`.

## Ch318 — LPDDR framebuffer write path on hardware (RTL sim-proven + fit-ready; board gated)

Connects the Ch317 write path to the real fabric→LPDDR port. qsys_top exposes an
`f2sdram` AXI4-256 port (was tied off); the GS runs on design_clk, f2sdram on CLOCK2_50 —
genuinely async. New `gs_async_fifo` (gray-code CDC) + `gs_lpddr_axi_master` (thin wrapper,
per Codex — does NOT touch the proven writer): GS-domain packer (16 px → one 256-bit
tile-row beat {addr,data,strb}) → async FIFO → f2sdram AXI burst FSM (single-beat INCR,
AWSIZE=5, AWLEN=0, full WSTRB). A HARD `write_enable` gate (packer + awvalid/wvalid + FSM)
makes an LPDDR write impossible unless explicitly enabled — Linux-safety. de25 top exposes
the PSMCT16 flush stream and, under `ifdef GS_LPDDR_FB`, drives the f2sdram write channel
(default = legacy inert tie-off → byte-identical; with the macro, write_enable=0 + FB_BASE=0
placeholder, so the fitted core boots inert). Proven: tb_gs_lpddr_axi_master (gate-off →
zero AXI activity; gate-on → 16 INCR beats, 0 protocol/bresp/FIFO errors, slave readback ==
source, under AW/W/B backpressure + async clocks); de25 elaborates EXIT 0 both ways;
regression byte-identical. fifo_full gotcha: use `count[PTR_W]` (a PTR_W-wide literal
truncates DEPTH→0). The BOARD run is GATED on a Linux-safe LPDDR address (owner: /proc/iomem
→ reserved region → FB_BASE → raise write_enable → write 8 KiB → HPS devmem readback/hash).
HW acceptance = write/readback + fitter snapshot. Ch319 = LPDDR scanout.

## Ch319 — LPDDR4B framebuffer write + HPS-bridge readback (SILICON-VERIFIED)

The f2sdram/HPS-DRAM path of Ch318 was CLOSED as platform-rejected (BRESP 256/256 on the
secure reserved region — /dev/mem reads of 0x80000000 crash the board). The GS framebuffer
pivots to **FPGA-private LPDDR4B** via the EMIF_Qsys IP (cloned from de25_lpddr4_bw/ao486,
same device): emif_clk ~310 MHz, emif_reset_n = cal-ready. Reuse the Ch318 writer chain
(`gs_lpddr_axi_master` + `gs_async_fifo` + counters), just retargeted onto the EMIF AXI
write port instead of f2sdram. New `gs_lpddr_rd_probe.sv` lets the HPS read any FB word back
over the bridge (`LPDDR_RDADDR` @ 0x03C: write byte-addr → poll `LPDDR_STATUS[3]` rd_pending
→ read the 32-bit word); the `lpddr_dump` tool walks this to pull a whole frame to a PPM.
**Crucially the FB is FPGA-private, NOT Linux RAM** — so verification is via the bridge probe
+ bridge COUNTERS (bytes/bursts/bresp_err), never /dev/mem. SILICON-VERIFIED: write 8 KiB →
bridge readback hash matches the source (md5 3b12baffc00bb6419fa66272c75b2cc7), BRESP_ERRS=0.

## Ch320 — LPDDR4B scanout to HDMI (SILICON-VERIFIED)

Display the LPDDR4B framebuffer on HDMI. `gs_lpddr_scanout.sv`: a whole-frame cache (an 8 KiB
M20K copy of the frame, NOT an ao486-style line buffer) filled from LPDDR via single-beat
reads (arlen=0 — the ONLY AXI read pattern proven on this EMIF; multi-beat bursts garble),
indexed by the PCRTC `vram_read_addr`. `gs_lpddr_rd_arb.sv`: a 2:1 read arbiter sharing the
EMIF read channel (port0 scanout = priority, port1 Ch319 probe). de25 top muxes the video
source (BRAM default / LPDDR scanout) on `LPDDR_CTRL[2]` video_src, gated by the PCRTC
display-window (`pix_window_o`). SILICON-VERIFIED at 64×64: scanout pixel-identical to BRAM.
**Bug found+fixed on silicon:** the scanout ignored the PCRTC display window → 10 sheared
tiles; fixed by exposing `pix_window_o` and gating the scanout mux. The whole-frame cache
DOES NOT SCALE — see Ch321: at 1024 beats (32 KiB) it never finishes loading on this EMIF.

## Ch321 — larger FB (128×128 PSMCT16) + LINE-BUFFER scanout (SILICON-VERIFIED) — ACCEPTED ARCHITECTURE

Two bricks. **Brick 1 (render):** new 128×128 PSMCT16 fixture (32 KiB frame) +
`GS_TILE_LPDDR128_DEMO` profile (VRAM grown 8→64 KiB so a 32 KiB frame fits, TILE grid 8×8 of
16×16 tiles). **Brick 2 (scanout) — the real deliverable:** `gs_lpddr_scanout_lb.sv`, a
double-buffered LINE-BUFFER reader that holds just TWO scanlines (displays row L from buf[L&1]
while prefetching the next row), O(width) on-chip not O(width×height). **DECISION: the
whole-frame cache is REFERENCE/FALLBACK only, NOT the architecture** — a cache that "fits"
still MIRRORS the FB in M20K, defeating the move to LPDDR, and empirically it won't even load
a 32 KiB frame on this EMIF (frame-cache `0x4` → cache_valid never sets, blank). The
line-buffer is THE scanout path going forward. SILICON-VERIFIED: render BURSTS=0x400/BRESP=0;
line-buffer `LPDDR_CTRL=0xC` → STATUS line_valid=1/rd_errs=0, HDMI matches the lpddr_dump PPM
pixel-for-pixel (no col-1 band — the sim TB's residual 1px/line was confirmed a checker
leading-edge artifact, not hardware). **Three real HARDWARE bugs fixed** (the first board
attempt garbled): (1) multi-beat burst → single-beat reads (arlen=0); (2) miss-prone request
toggle → free-running sequential prefetcher; (3) vsync-mid-read AXI abort → deferred reset
(`fs_pending`, never abort an in-flight read). Fit clean: 31,683 ALMs (68%), 117 RAM (33%).

## Next (per Codex)
The framebuffer now lives off-chip (write + line-buffer scanout, silicon-proven). Make
TEXTURE storage external next, correctness-first, before any performance sizing:
1. **Ch322 — LPDDR-backed texture fetch/cache (correctness-first).** One known texture in
   LPDDR4B; a small read-only texture cache behind the sampler; BRAM texture path stays as
   fallback. Acceptance: LPDDR-textured image == BRAM-textured image, rd_errs=0, counters
   prove LPDDR fills happened. NOTE (prereq-check finding): the nearest-path sampler assumes
   FIXED 1-cycle texel latency (no stall on the default path — CB_TWAIT only exists for
   bilinear/combined), so a naive demand-miss stall would corrupt output. Resolve via
   prefetch-warm (fill the cache fully before raster → every read a 1-cycle hit) OR add a
   sampler/walker stall — see the Ch322 framing.
2. Framebuffer/Z backing to LPDDR with tile flush/reload.
3. Command-stream ingestion (defer until both FB and texture memory are off-chip).
Only after a real-trace texture-format histogram (doc 0008 §4) is performance LPDDR sizing
honest; Ch322 is correctness-only and does NOT pretend to know real-game cache sizing.