# 0010 — On-chip tile-local renderer core (first tiled-VRAM rung) **Status:** proven in sim (Ch303), board-pending. One 16×16 tile, on-chip color+Z, flush to VRAM. Texture still BRAM-VRAM. NO LPDDR, NO multi-tile binning yet. ## Why doc 0009 established the per-pixel requirement for a combined textured+alpha+depth primitive: **hidden = 1 read / 0 writes; visible = 3 reads / 2 writes.** doc 0008 §6 sets the target architecture: on-chip tile color+Z (RMW resolved on-chip), LPDDR as backing VRAM, texture streamed/cached. This rung builds the **first piece**: the on-chip tile color+Z scratchpad with flush, so the combined RMW happens on-chip and only the texture fetch + the tile flush cross to VRAM. (Codex framing: build the tile-local core first; stage LPDDR integration later.) ## What was built - **`gs_tile_ram`** (rtl/gif_gs/gs_tile_ram.sv): generic 1W1R on-chip tile RAM, registered read (1-cycle, matching the VRAM read2 contract). Instantiated twice in gs_stub (gated by `TILE_LOCAL`): `u_tile_color` (256×32) + `u_tile_z` (256×32) — one 16×16 tile, ~2 KiB total. - **gs_stub `TILE_LOCAL` mode** (default 0 → byte-identical): a combined TME+ABE+ZTE triangle renders into the tile via a CLEAR → RENDER → FLUSH sequence overlaid on the existing R_IDLE/R_SCAN/R_DRAIN FSM. ## The tile memory schedule (the deliverable) ``` CLEAR : 256 cycles → write tile_color = TILE_CLEAR_COLOR, tile_z = TILE_CLEAR_Z (every entry initialized; the "background") RENDER (per inside pixel, tile index = {y[3:0], x[3:0]}): beat0 read tile_z beat1 Z-test (GEQUAL: frag_z vs tile_z). FAIL → STOP (no texture read, no tile_color read, no tile_color/tile_z write). PASS → read texture (VRAM) beat2 texel ready (Cs/As) ; read tile_color (dest) beat3 blend ; WRITE tile_color beat4 WRITE tile_z (skip on ZMSK) FLUSH : 256 cycles → read tile_color[idx] (registered) → framebuffer write (raster_pixel_emit → VRAM at the linear FB address). ~70 MB/s class per doc 0008 — trivial. ``` In tile terms: - **hidden pixel:** tile_z read only. No texture, no tile_color read/write, no tile_z write. - **visible pixel:** tile_z read + texture read + tile_color read + tile_color write + tile_z write. - **flush:** tile_color read → framebuffer write, ×256. Texture stays on the VRAM read2 path (unchanged). Only color/Z moved on-chip. ## Verification (tb_top_psmct32_tile_demo) Combined triangle (interpolated Z crossing the clear Z) over a CLEAR'd green tile. A tracer on the tile-RAM ports + the emit port asserts the schedule: - CLEAR wrote 256 color + 256 Z entries. - hidden (depth-fail) pixels: no texture read, no tile_color write, no tile_z write. - visible (depth-pass) pixels: texture read + tile_color write + tile_z write; rendered color = blend(texel, clear-green); occluded/outside = clear green. - FLUSH emitted 256 framebuffer writes; final scanout matches the Ch302 image. Result: clear 256/256, flush 256, 35 visible / 7 hidden, errors=0. `TILE_LOCAL=0` keeps every prior demo byte-identical. ## External LPDDR bandwidth model (documented, not yet exercised) Per doc 0008: framebuffer flush is **trivial** (640×480×4 B ≈ 1.2 MB/frame ≈ 70 MB/s @60fps, noise vs the measured 8.4 GB/s). Texture fetch + tile-reload from primitive disorder are the real DDR consumers and are **locality-driven** — to be measured against real GS traces (doc 0008 §4–5), not synthesized. This rung does NOT touch LPDDR: texture is BRAM-VRAM, one tile, no eviction. ## Ch304 — 2×2 multi-tile grid (extension) The single-tile core generalizes to a `TILE_COLS×TILE_ROWS` grid with minimal change, because (a) `tile_idx = {y[3:0],x[3:0]}` is already the tile-local address for any 16-aligned tile, and (b) attribute interpolation is screen-space → seams are continuous by construction. Added: an outer tile loop (the popped primitive + solved gradients persist across all tiles), per-tile walker-bbox clip to `primitive_bbox ∩ tile` (skip render if no overlap → tile shows clear color), and a flush FB-address offset by the tile origin. `TILE_COLS=TILE_ROWS=1` is byte- identical to the single-tile path. Codex scope: fixed primitive list (one primitive re-rendered per tile), re-test-against-each-tile, NO external bin memory. Proven (tb_top_psmct32_tile2x2_demo): one triangle spanning a 2×2 grid (32×32, crossing x=16 & y=16) — all 4 tiles clear independently (256 each), 1024 flush emits, and the **whole 32×32 scanout matches a single screen-space reference** (718/718, 67 seam-region pixels continuous) → no visible seams. This is the re-test-each-primitive-against-each-tile architecture; a real binning engine / command buffer is a later optimization, not needed for the architectural proof. ## Ch305 — MULTI-PRIMITIVE tiled scene (extension) Generalizes the grid from re-rendering ONE primitive per tile to compositing a LIST of primitives per tile, in order, so later primitives depth-test/alpha-blend over earlier ones within each tile. Gated by `TILE_MULTIPRIM` (default 0 → byte-identical) + `TILE_PRIM_COUNT` (batch size). The primitive FIFO IS the list store (its slots already hold each primitive's pre-solved gradients), so re-reading a slot is free. Per tile: CLEAR → load+render prim 0 → (pipeline-flush) → load+render prim 1 → … → FLUSH. The grid starts only once the whole batch is buffered (`fifo_count >= TILE_PRIM_COUNT && all_grad_done`) — the demo-honest stand-in for a future GIF-EOP/kick. Empty-clip primitives are skipped per tile; the whole FIFO is drained at grid end (streaming/partial-drain is future work). The inter-primitive advance waits for the per-pixel pipeline to fully flush (`comb_pipe_empty`), not just the walker reaching R_DRAIN, so a primitive's in-flight color/Z writes commit before the next primitive's `ras_*` load. Proven (tb_top_psmct32_tile_multiprim_demo): 3 combined prims over the 2×2 grid — opaque blue bg (Z=0x5000), opaque red (Z=0x6000, in front), translucent white (Z=0x5800, blends but is OCCLUDED by red where 0x5800 < 0x6000). The whole 32×32 scanout matches a software integer-Z-buffer + source-over replay (514/514), with all interaction regions exercised (blue 24 / red 48 / light-blue 26 / occlusion 19 / green 416) and seam continuity (50 seam matches). This is the architecture a real command-stream replay needs; a per-tile bin buffer is a later optimization. DEBUG NOTES (two non-obvious bugs surfaced): (1) the per-primitive clip wires indexed a FIFO array through a function inside continuous `assign`s — iverilog-12 mis-reads that as 0 (silent hang, sim-time frozen); fixed by computing them in `always_comb` (legal SV, Quartus-clean; sim-only workaround). (2) The first failing image was a FIXTURE bug, not RTL: three solid 4×4 textures placed 0x100 apart aliased, because a PSMCT32 texture with TBW=1 has a 0x100-byte row stride so a 4-tall texture spans 0x400 bytes; spacing them 0x400 apart (TBP0 32/36/40) fixed it. The depth/RMW path was correct all along. ## Ch306 — GS SCISSOR clipping (extension) Bakes the GS SCISSOR_1 rectangle into the tile-traversal walker bounds (param `SCISSOR_ENABLE`, default 0 → byte-identical). Because the scissor is a rectangle and the walker scans a rectangle, the effective draw region = primitive bbox ∩ tile bbox ∩ scissor rect is itself rectangular — so the scissor is just intersected into the walker bbox at all clip sites (single-prim `clip_*`, multiprim `always_comb`, and the `mp_next_nonempty` empty-test). NO per-pixel scissor test: pixels outside the scissor are never visited, so color and Z writes are both suppressed for free. SCISSOR_1 (GIF reg 0x40) is parsed into a GLOBAL `scissor_1_q` (reset full-range); decoded fields → 12-bit `eff_sc*` gated by SCISSOR_ENABLE (0/0xFFF when off → max/min no-op). Per- primitive (FIFO-snapshot) scissor is a future extension if a command stream varies it. Proven (tb_top_psmct32_tile_scissor_demo): the Ch305 3-prim scene + SCISSOR_1 [9..22]×[6..20] (crossing both seams) — 514/514 match, clipped=39 (would-be-scene pixels outside the rect are clear green), inside=59 (kept scene matches the unclipped ref), exact boundary (edgePairs=6), seam=50. Regression 209→210, byte-identical. ## Ch307 — texture WRAP modes (REPEAT + CLAMP) (extension) Adds GS texture wrap (CLAMP_1 WMS/WMT: REPEAT/CLAMP) for u/v, inside `gs_texture_unit` (param `TEX_WRAP_ENABLE`, default 0 → pass-through byte-identical). Applied to u/v BEFORE texel-address gen, so it covers the linear and swizzle paths and all callers at one point. REPEAT = `u & (2^TW - 1)`; CLAMP = `min(u, 2^TW-1)`. gs_stub parses CLAMP_1 (reg 0x48) and snapshots wrap mode + TW/TH per primitive (FIFO, like ras_tbw), so REPEAT and CLAMP primitives coexist in one scene. Codex sequencing: wrap/clamp before bilinear, because it determines which edge neighbours a future bilinear filter samples. Proven: a standalone sampler TB (tb_gs_texture_wrap) covers PSMCT32 + PSMT8 + PSMT4-swizzle (wrap happens before swizzle); the board TB (tb_top_psmct32_tile_wrap_demo, 557/557) renders two textured tris sampling a striped 4×4 texture with UV 0..8 — REPEAT tiles 2× (two white stripes), CLAMP sticks (one white stripe + edge-stretched blue). Regression 210→212, byte-identical. (Fixture lesson: the first NON-solid tile texture exposed an upload-giftag REGS nibble-count bug that solid textures had masked.) ## Ch308 — PSMCT16 tile color buffer (extension) The on-chip tile COLOR RAM can be PSMCT16 (RGB5A1, 16-bit) instead of PSMCT32, via param `TILE_COLOR_PSMCT16` (default 0 → byte-identical). It HALVES the color tile RAM (`TILE_COLOR_W` = 16; Z RAM stays 32-bit) — the first answer to "can tile color be narrower than the 32-bit blend width when the frame format allows it?" (yes). The RMW packs ABGR8888→pix16 on write/clear, unpacks pix16→ABGR (bit-replicate) for the blend dest, and the FLUSH emits PSMCT16 framebuffer writes (mirroring the proven S2 PSMCT16 emit; vram_normalize keys the halfword off byte_addr[1]). Scanout reads PSMCT16 via DISPFB1.PSM=0x02. CONSTRAINT discovered: the combined tile path's primitive eligibility requires FRAME.PSM==PSMCT32 (the combined RMW was built PSMCT32-only). So the PSMCT16-ness lives in the tile RAM + flush + DISPFB, NOT in FRAME.PSM — the demo keeps FRAME PSMCT32 (so the prims classify as combined) while DISPFB + tile + flush are PSMCT16. A fully-PSMCT16 FRAME would need the combined gate relaxed to accept PSMCT16 dest (future work). Proven (tb_top_psmct32_tile_psmct16_demo, 514/514): the Ch305 scene in PSMCT16, matched against a software reference that applies the SAME per-step 5-bit quantization (q(c)=(c&0xF8)|(c>>5)) the on-chip RMW does — each primitive blends over the quantized dest. Clear green 0x80→0x84, light-blue 0x7F→0x7B, pure blue/red unchanged, red occlusion intact. Regression 212→213, byte-identical. ## Ch309 — generic GS ALPHA blend modes (extension) Generalizes the combined blender from the single hardcoded source-over to the GS selector machinery `Cv = clamp(((A-B)*C)>>7 + D)` (A/B/C/D from ALPHA_1, FIX=[39:32]), param `ALPHA_MODES_ENABLE` (default 0 → source-over, byte-identical). gs_alpha_blend gains a_sel/b_sel/c_sel/d_sel/ad/fix inputs + a generic datapath; gs_stub FIFO-snapshots the per-primitive selectors+FIX and wires them to u_comb_blend. The combined-eligibility gate `close_combined` (which hardcoded source-over) is relaxed to accept any ABE primitive when ALPHA_MODES_ENABLE — the generic blender handles any config. (Same class of "eligibility gate too strict" as Ch308's PSMCT32-FRAME requirement: when you add a per-pixel mode to the combined path, check the datapath AND close_combined.) Proven (tb_top_psmct32_tile_alpha_demo, 514/514): the Ch305 scene with P1 ADDITIVE (A=Cs,B=0,C=FIX=0x80,D=Cd → Cs+Cd) → magenta over the blue bg (glow/particle add), while P0/P2 stay source-over (light-blue intact) and P2 is still depth-occluded by P1. Two blend modes coexist. Regression 213→214, byte-identical. ## Ch310 — bilinear texture filtering (extension, 2-phase) 4-tap bilinear (PSMCT32), staged per Codex. PHASE 1: a multi-beat bilinear sampler in gs_texture_unit (param BILINEAR_ENABLE, default 0): reads the 4 neighbours (each via the Ch307 wrap), lerps by fractional U/V; schedule = 4·(1+RD_LATENCY)+1 ≈ 9 cyc/sample (the architectural number for the future texture cache). Proven by a standalone TB (tb_gs_texture_bilinear) — all 6 cases exact (center=nearest, halfway=4-tap avg, clamp edge no-OOB, repeat edge wraps, nearest unchanged). PHASE 2: integrated into the COMBINED tile path — TEX1.MMAG (GIF 0x14 bit5) per-primitive selects nearest vs linear; a runtime `filter_lin` input gates the 4-tap; the affine interp gains a frac sibling (interp_affine_uv_frac → step[15:12]); a new CB_TWAIT beat stalls the per-pixel FSM on the LEVEL !tex_busy until the ~9-cycle sample completes (the FSM steps half-rate on z_advance, so a level wait can't miss the 1-cycle out_valid), then CB_T latches the HELD filtered texel. Depth/Z/blend/tile-RMW unchanged; bilinear did NOT touch close_combined (the prim is still source-over ABE). Proven (tb_top_psmct32_tile_bilinear_demo): a magnified 4×4 blue/white checker, nearest tri blocky (0 midtones) vs bilinear tri smoothed (all midtones), same coverage (stall dropped nothing). Regression 215→216, byte-identical. ## Ch311 — per-tile BIN BUFFER (extension) Replaces Ch305's render-time re-test (mp_next_nonempty: each tile re-scans all prims) with a real precomputed bin buffer (param BIN_BUFFER_ENABLE, default 0). A new TP_BIN phase runs a (prim,tile) double-loop counter FSM (prim_count×NTILES cycles) that tests each prim's bbox∩tile∩scissor (the same overlap math) and appends the prim index to bin_prim[tile][] / bin_n[tile], in ascending draw order. The render then walks each tile's bin (CLEAR-done loads bin slot 0; RENDER-drain steps through bin_n; FLUSH at end) — no re-scan. Equivalent image to the re-test path (same overlap test + order). This is the primitive-ROUTING machinery for command-stream replay; the grid stays 2×2 (prove the mechanism, scale later). Proven (tb_top_psmct32_tile_bin_demo): bins read back exactly (t0={0,1} t1={0,1} t2={0} t3={0,2} for an all-tiles/2-tiles/1-tile prim trio) + image 594/594 vs the re-test reference. Regression 216→217, byte-identical. ## Next (staged, per Codex) 1. Multiple tiles / tile grid (primitive→tile binning). [DONE: Ch304 grid, Ch305 list, Ch311 bin buffer] Scissor/window clipping. [DONE: Ch306] Texture clamp/repeat. [DONE: Ch307] PSMCT16 tile color. [DONE: Ch308] ALPHA mode expansion. [DONE: Ch309] Bilinear filtering. [DONE: Ch310 — sampler + combined-path integration] Larger grid sweep. [DONE: Ch312 — 2x2→4x4 (16 tiles, 64x64) via the bin buffer; no new RTL logic, COLS/ROWS/NTILES already parameterized] ## Ch312 — 4x4 grid (extension) Scales the tiled renderer to a 4×4 grid (16 tiles, 16×16 each = 64×64) by setting TILE_COLS=TILE_ROWS=4 — NO new RTL logic, since the grid loop + bin buffer (NTILES, CUR_T_W/BIN_T_W via $clog2) were already parameterized. 64×64 PSMCT32 FB fills 16 KiB so the demo uses VRAM 32 KiB (textures @ 0x4000). Proven (tb_top_psmct32_tile_bin4x4_demo): 3 prims (P0 4-tile / P1 6-tile cross-seam / P2 1-tile, + 6 empty tiles), all 16 bin_n read back exactly (1100 1211 0111 0001, empty=0), t5={0,1} order preserved, image 3240/3240 vs the re-test reference, seam continuity across x=16/32/48 + y=16/32. Regression 217→218, byte-identical. The fit (owner) gives the resource-scaling answer: bin storage grows 4× (60→240 register bits, still tiny) — a hard ALM/register jump would signal the register-bins should go BRAM/MLAB-backed before larger scenes. ## Ch313 — full PSMCT16 framebuffer mode (extension) Relaxes the `close_combined` eligibility gate so the combined/tiled path accepts a PSMCT16 dest (`frame_1_q[29:24]==6'h02`) — but ONLY when `TILE_COLOR_PSMCT16=1`, so a PSMCT16 FRAME never pairs with a PSMCT32 flush. This was the LAST place forcing a PSMCT32 FRAME: the tile color RAM, the dest-color unpack for blending, and the flush emit (be=`4'b0011`, psm=`0x02`, `<<1` byte addr) were ALL already PSMCT16 from Ch308, keyed off `TILE_COLOR_PSMCT16` and independent of `FRAME.PSM`. So Ch308's PSMCT32-FRAME workaround is gone — render/flush/scanout are now consistently RGB5A1. One-term RTL change; at `TILE_COLOR_PSMCT16=0` (default) the new disjunct is constant-0 and the gate collapses to the original PSMCT32-only test (byte-identical). Demo = the Ch312 4×4 (64×64) scene with `FRAME.PSM=PSMCT16` + DISPFB PSMCT16. A 64×64 PSMCT16 FB is 8 KiB — HALF the 16 KiB PSMCT32 FB — so the demo runs in **16 KiB VRAM vs Ch312's 32 KiB**: the direct framebuffer-memory saving that motivates the LPDDR-backed FB phase. Proven (tb_top_psmct32_tile_psmct16fb_demo): flush 4096/4096 carry psm=0x02 + be=`0011`, ZERO PSMCT32 flushes (whole FB is 16-bit); image 3240/3240 vs a re-test reference replayed with per-step RGB5A1 quantization `q5(c)=(c&0xF8)|(c>>5)` (EXACT); 2875 matched pixels differ from the would-be PSMCT32 value (proves the FB is genuinely RGB5A1, not PSMCT32); bin_n/scissor/depth identical to Ch312 (1100 1211 0111 0001, t5={0,1}, seam 464). FIT-CLEARED + VISUALLY VERIFIED on Agilex 5 (2026-06-01): vs Ch312 (4×4 PSMCT32, 32 KiB), RAM blocks 45→29 (−16), block-mem 688,128→421,888 (−256 Kbit), ALMs −159, regs −555 — the PSMCT16 FB recovered ALL of Ch312's 4×-scale-up memory cost, landing back on the Ch311 2×2 PSMCT32 baseline of 29 RAM blocks. A 4× grid in PSMCT16 costs ZERO extra framebuffer memory vs the 2×2 PSMCT32 grid: hard proof the framebuffer (not bins/logic) is the on-chip memory consumer and pixel format trades directly against it. Board image matches (blue/red/teal tris + green, RGB5A1-quantized, no seams). ## Ch314 — bilinear for palettized (PSMT8/PSMT4) textures (extension) Extends bilinear to INDEXED textures with the CLUT-BEFORE-INTERPOLATE rule: each of the 4 taps fetches an index, CLUTs it to RGBA, then the 4 COLORS interpolate (NOT the indices — that would round to one palette entry). The sampler core is ~6 lines: the bilinear FSM tap capture changes from `tap[beat] <= tex_rd_data` to `tap[beat] <= near_color` (already `(PSMT8||PSMT4)?clut_rd_data:tex_rd_data`), so PSMCT32 is byte-identical and indexed taps capture the CLUT'd color. New param PALETTE_BILINEAR (default 0) widens `do_lin` to admit PSMT8(0x13)/PSMT4(0x14). The per-tap addr-gen (linear/swizzle + wrap/clamp) already runs BEFORE the CLUT lookup, so "swizzle-before-CLUT" + edge wrap/clamp are free. For the BOARD demo (bilinear lives only in the combined path), `close_combined`'s texture-PSM gate also widens to admit PSMT8/PSMT4 when PALETTE_BILINEAR; the shared gs_texture_unit already had the CLUT port wired and CLUT is a combinational 3rd port (no read2-arbitration change). Proven: tb_gs_texture_bilinear (unit) CASE7 PSMT8 red↔blue halfway → 0xFF7F007F (purple, neither endpoint = colors interpolated), CASE10 PSMT4 nibble across a byte boundary, CASE11 repeat / CASE12 clamp edges + no OOB, CASE1-6 PSMCT32 byte-identical; tb_top_psmct32_tile_palbilinear_demo (board, combined path + CLUT load) nearMid=0 / bilMid=58 — the on-board CLUT-before-interp proof. Regression 219→220 byte-identical, board elab EXIT 0. FIT-CLEARED + VISUALLY VERIFIED on Agilex 5 (2026-06-01): LEFT tri blocky blue/white indexed checker, RIGHT tri smoothed blue↔white midtones (CLUT-before-interp on silicon). RESOURCE DELTA vs Ch310 (2×2 PSMCT32 bilinear, 16KB): ALMs 30,229→30,101 (flat), DSP 122→122 (0), block-mem 425,984→425,984 (0), RAM blocks 29→29 (0) — palettized bilinear is essentially FREE: zero extra DSP (reuses the same lerp multipliers), zero extra memory (reuses the CLUT port from Ch296), the CLUT-before-interp restructure is just a mux on the tap-capture path. ## Ch315 — primitive/bin capacity scaling (extension) Parameterizes the primitive FIFO depth (was a hardcoded `FIFO_DEPTH=4`) as `TILE_FIFO_DEPTH` (default 4 → byte-identical; power-of-2). In the bin-buffer renderer this depth sizes BOTH the prim-list capacity N AND the per-tile bin depth M (bins are `[NTILES][FIFO_DEPTH]`), so they're coupled (M=N: a tile's bin can hold every queued prim). Adds sim-visible diagnostics (`raster_overflow_count_r`, `bin_occ_max_r`, defensive `bin_overflow_r`). ARCHITECTURAL ANSWER to "where do register bins stop being reasonable": the dominant cost is the ~40 `fifo_*` per-prim attribute arrays (hundreds of register bits/slot); the bins add only `NTILES*FIFO_CNT_W` index bits per depth (~48 bits/depth at 4×4 — negligible), so register bins stay cheap far past the FIFO's practical limit. OVERFLOW nuance: the batched tile path triggers at `TILE_PRIM_COUNT` and drains the FIFO, so excess prims are CLAMPED (visible as capped bin occupancy), not push-dropped — `raster_overflow` (the streaming push-while-full flag, now counted) doesn't fire in the batched path; and `TILE_PRIM_COUNT` must be `<= FIFO_DEPTH`. Proven: tb_top_psmct32_tile_cap_demo (depth 8, 7 prims) — bin t0 holds 6 (occ_max=6 > old 4), draw order {0..5}, image 3873/3873, no overflow; tb_top_psmct32_tile_cap_overflow_demo (depth 4, same payload) — occ_max CLAMPS to 4 (capacity ceiling) and still renders all 16 tiles gracefully. Regression 220→222 byte-identical, board elab EXIT 0. (Demo puts the deep bin in t0 to dodge an orthogonal latent bug: empty tiles preceding the first non-empty tile flush black — to be fixed separately.) FIT-CLEARED + VISUALLY VERIFIED on Agilex 5 (2026-06-01). RESOURCE SLOPE (depth-8 vs Ch312 depth-4): ALMs 29,682→32,072 (+2,390), regs 33,356→37,486 (+4,130), block-mem + RAM-blocks UNCHANGED (688,128 / 45). So +4 FIFO slots = ~1,033 regs + ~600 ALMs PER primitive slot, ZERO block RAM — dominated by the ~40 fifo_* attribute arrays; the bins add ~80 regs/slot (negligible). The bins never stop being reasonable; the per-prim attribute FIFO is the ALM-bound scaling wall (~16-prim headroom at this grid). Beyond that, move the per-prim attribute storage (not the bins) to block RAM. ## Ch316 — leading-empty-tile traversal fix (correctness) Fixes the latent bug found in Ch315: tiles that are EMPTY and PRECEDE the first non-empty tile flushed BLACK instead of the clear colour. ROOT CAUSE: the per-tile flush row-stride is `flush_pixel_index_w = flush_y*(ras_fbw<<6)+flush_x` (gs_stub ~line 3408), and `ras_fbw` (FRAME width) was loaded ONLY by `mp_load_prim` (on primitive load). A leading-empty tile loads no prim, so it used the reset `ras_fbw=0` → stride 0 → every row collapsed onto row 0's FB addresses → the tile's real screen rows kept the FB-init value (black). Empties AFTER a render inherited that render's ras_fbw, hence were fine — the exact asymmetry observed. FIX: in the `mp_grid_start` branch (~line 5588) load `ras_fbp/ras_fbw/ras_psm/ras_bpp_shift` from the batch's oldest FIFO entry (`fifo_*[fifo_rptr]`) at GRID-RENDER START, so the flush address is valid for EVERY tile. A batch shares one FRAME, so this equals what `mp_load_prim` sets at render → byte-identical for any batch whose first tile is non-empty. Proven: tb_top_psmct32_tile_late_demo (1 prim only in t15, t0..t14 empty) — ZERO black pixels, all empty tiles green-cleared, bin_n[15]=1 (renderer reached the last tile, no premature done), image 3990/3990; Ch315 cap_demo still 3873/3873. Regression 222→223 byte-identical, board elab EXIT 0 (GS_TILE_LATE_DEMO). Root-caused with a direct VRAM probe (FB at empty tiles 0x0 → 0xFF008000 after the fix). FIT-CLEARED + VISUALLY VERIFIED on Agilex 5 (2026-06-01): whole 64×64 green (all 15 leading empties clear correctly) + one blue triangle in t15; resources on the Ch312 baseline (ALMs 29,801, regs 32,442, block-mem/RAM unchanged) — the fix adds zero storage (loads 4 existing ras_* regs at grid start), pure control-flow. ## Ch317 — LPDDR-backed framebuffer, tile-flush only (sim write/readback proof) First external-framebuffer step, deliberately tight: ONLY the PSMCT16 tile FLUSH is redirected to an LPDDR framebuffer; tile color/Z + texture stay on-chip. The proven LPDDR path (doc 0008, 8.4 GB/s, 256-bit AXI4 → EMIF hard-IP) lives in a SEPARATE diagnostic core, not the GS top — so this rung proves the write/readback path against a behavioral LPDDR MODEL (no board fit; wiring the real EMIF master + LPDDR scanout into the GS top is the next rung). New module `gs_lpddr_fb_writer.sv`: a staging FIFO + burst engine (coalesces a contiguous +2 run into one burst, 4 KiB cap per the doc 0008 AXI lesson) + byte-addressed backing FB + bandwidth/over-underflow counters. Consumes the existing flush stream (`raster_pixel_fb_addr_q` is already the linear `fb_base+(y*pitch+x)*2`). Integrated into the bram top generate-guarded by `LPDDR_FB_ENABLE` (default 0 → not instantiated, byte-identical), as a transitional ADDITIVE mirror (BRAM FB still feeds scanout; LPDDR is the readback-proof target). Proven: tb_gs_lpddr_fb_writer (256-px tile → 512 B / 16 bursts; 2049-px run → 2 bursts via the 4 KiB cap; enable=0 inert) and tb_top_psmct32_tile_lpddrfb_demo (Ch313 PSMCT16 scene → LPDDR FB == BRAM FB for all 4096 px; 8192 bytes; 256× 32-B bursts; no over/underflow; ~0.20 GB/s @100 MHz model). Regression 223→225 byte-identical, board elab EXIT 0 (writer pruned at default). FIX worth noting: a `PTR_W'(FIFO_DEPTH)` truncation read the FIFO empty-as-full; use `count[PTR_W]`. ## Ch318 — LPDDR framebuffer write path on hardware (RTL sim-proven + fit-ready; board gated) Connects the Ch317 write path to the real fabric→LPDDR port. qsys_top exposes an `f2sdram` AXI4-256 port (was tied off); the GS runs on design_clk, f2sdram on CLOCK2_50 — genuinely async. New `gs_async_fifo` (gray-code CDC) + `gs_lpddr_axi_master` (thin wrapper, per Codex — does NOT touch the proven writer): GS-domain packer (16 px → one 256-bit tile-row beat {addr,data,strb}) → async FIFO → f2sdram AXI burst FSM (single-beat INCR, AWSIZE=5, AWLEN=0, full WSTRB). A HARD `write_enable` gate (packer + awvalid/wvalid + FSM) makes an LPDDR write impossible unless explicitly enabled — Linux-safety. de25 top exposes the PSMCT16 flush stream and, under `ifdef GS_LPDDR_FB`, drives the f2sdram write channel (default = legacy inert tie-off → byte-identical; with the macro, write_enable=0 + FB_BASE=0 placeholder, so the fitted core boots inert). Proven: tb_gs_lpddr_axi_master (gate-off → zero AXI activity; gate-on → 16 INCR beats, 0 protocol/bresp/FIFO errors, slave readback == source, under AW/W/B backpressure + async clocks); de25 elaborates EXIT 0 both ways; regression byte-identical. fifo_full gotcha: use `count[PTR_W]` (a PTR_W-wide literal truncates DEPTH→0). The BOARD run is GATED on a Linux-safe LPDDR address (owner: /proc/iomem → reserved region → FB_BASE → raise write_enable → write 8 KiB → HPS devmem readback/hash). HW acceptance = write/readback + fitter snapshot. Ch319 = LPDDR scanout. ## Ch319 — LPDDR4B framebuffer write + HPS-bridge readback (SILICON-VERIFIED) The f2sdram/HPS-DRAM path of Ch318 was CLOSED as platform-rejected (BRESP 256/256 on the secure reserved region — /dev/mem reads of 0x80000000 crash the board). The GS framebuffer pivots to **FPGA-private LPDDR4B** via the EMIF_Qsys IP (cloned from de25_lpddr4_bw/ao486, same device): emif_clk ~310 MHz, emif_reset_n = cal-ready. Reuse the Ch318 writer chain (`gs_lpddr_axi_master` + `gs_async_fifo` + counters), just retargeted onto the EMIF AXI write port instead of f2sdram. New `gs_lpddr_rd_probe.sv` lets the HPS read any FB word back over the bridge (`LPDDR_RDADDR` @ 0x03C: write byte-addr → poll `LPDDR_STATUS[3]` rd_pending → read the 32-bit word); the `lpddr_dump` tool walks this to pull a whole frame to a PPM. **Crucially the FB is FPGA-private, NOT Linux RAM** — so verification is via the bridge probe + bridge COUNTERS (bytes/bursts/bresp_err), never /dev/mem. SILICON-VERIFIED: write 8 KiB → bridge readback hash matches the source (md5 3b12baffc00bb6419fa66272c75b2cc7), BRESP_ERRS=0. ## Ch320 — LPDDR4B scanout to HDMI (SILICON-VERIFIED) Display the LPDDR4B framebuffer on HDMI. `gs_lpddr_scanout.sv`: a whole-frame cache (an 8 KiB M20K copy of the frame, NOT an ao486-style line buffer) filled from LPDDR via single-beat reads (arlen=0 — the ONLY AXI read pattern proven on this EMIF; multi-beat bursts garble), indexed by the PCRTC `vram_read_addr`. `gs_lpddr_rd_arb.sv`: a 2:1 read arbiter sharing the EMIF read channel (port0 scanout = priority, port1 Ch319 probe). de25 top muxes the video source (BRAM default / LPDDR scanout) on `LPDDR_CTRL[2]` video_src, gated by the PCRTC display-window (`pix_window_o`). SILICON-VERIFIED at 64×64: scanout pixel-identical to BRAM. **Bug found+fixed on silicon:** the scanout ignored the PCRTC display window → 10 sheared tiles; fixed by exposing `pix_window_o` and gating the scanout mux. The whole-frame cache DOES NOT SCALE — see Ch321: at 1024 beats (32 KiB) it never finishes loading on this EMIF. ## Ch321 — larger FB (128×128 PSMCT16) + LINE-BUFFER scanout (SILICON-VERIFIED) — ACCEPTED ARCHITECTURE Two bricks. **Brick 1 (render):** new 128×128 PSMCT16 fixture (32 KiB frame) + `GS_TILE_LPDDR128_DEMO` profile (VRAM grown 8→64 KiB so a 32 KiB frame fits, TILE grid 8×8 of 16×16 tiles). **Brick 2 (scanout) — the real deliverable:** `gs_lpddr_scanout_lb.sv`, a double-buffered LINE-BUFFER reader that holds just TWO scanlines (displays row L from buf[L&1] while prefetching the next row), O(width) on-chip not O(width×height). **DECISION: the whole-frame cache is REFERENCE/FALLBACK only, NOT the architecture** — a cache that "fits" still MIRRORS the FB in M20K, defeating the move to LPDDR, and empirically it won't even load a 32 KiB frame on this EMIF (frame-cache `0x4` → cache_valid never sets, blank). The line-buffer is THE scanout path going forward. SILICON-VERIFIED: render BURSTS=0x400/BRESP=0; line-buffer `LPDDR_CTRL=0xC` → STATUS line_valid=1/rd_errs=0, HDMI matches the lpddr_dump PPM pixel-for-pixel (no col-1 band — the sim TB's residual 1px/line was confirmed a checker leading-edge artifact, not hardware). **Three real HARDWARE bugs fixed** (the first board attempt garbled): (1) multi-beat burst → single-beat reads (arlen=0); (2) miss-prone request toggle → free-running sequential prefetcher; (3) vsync-mid-read AXI abort → deferred reset (`fs_pending`, never abort an in-flight read). Fit clean: 31,683 ALMs (68%), 117 RAM (33%). ## Next (per Codex) The framebuffer now lives off-chip (write + line-buffer scanout, silicon-proven). Make TEXTURE storage external next, correctness-first, before any performance sizing: 1. **Ch322 — LPDDR-backed texture fetch/cache (correctness-first).** One known texture in LPDDR4B; a small read-only texture cache behind the sampler; BRAM texture path stays as fallback. Acceptance: LPDDR-textured image == BRAM-textured image, rd_errs=0, counters prove LPDDR fills happened. NOTE (prereq-check finding): the nearest-path sampler assumes FIXED 1-cycle texel latency (no stall on the default path — CB_TWAIT only exists for bilinear/combined), so a naive demand-miss stall would corrupt output. Resolve via prefetch-warm (fill the cache fully before raster → every read a 1-cycle hit) OR add a sampler/walker stall — see the Ch322 framing. 2. Framebuffer/Z backing to LPDDR with tile flush/reload. 3. Command-stream ingestion (defer until both FB and texture memory are off-chip). Only after a real-trace texture-format histogram (doc 0008 §4) is performance LPDDR sizing honest; Ch322 is correctness-only and does NOT pretend to know real-game cache sizing.