Perf: skip needless dedup re-cluster + interlock word-search grids

Two things found while chasing the recurring ~15min slowness:

- dedup.py: cluster_duplicates re-ran an O(n²) cosine pass over ALL ~3.7k
  articles and rewrote duplicate_of for every one of them EVERY cycle — even
  when nothing new arrived (embedded=0) — ~53s CPU + a large WAL commit that
  starved live API reads (/api/brief 2-7s). Now skip the re-cluster entirely
  when nothing new was embedded (clusters can't have changed). Verified: cycle
  drops from ~53s to ~1s and /api/brief stays at 20ms through a cycle, vs 2-7s
  before. (A real new article still triggers a full re-cluster.)

- games.py _build_grid: word placement took the first random valid spot, so
  words rarely crossed. Now gather valid placements and PREFER ones that cross
  an already-placed word (shared matching letter), falling back to any valid
  spot — so the grid interlocks like a real word search. Every word still
  placed (tests green). NOTE: changes today's grid layouts, so an in-progress
  word search resets once.

Also added a systemd drop-in (Nice=19/CPUWeight=20/IOWeight=10/ionice-idle) to
deprioritize the batch cycle — minor, the dedup skip is the real fix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
jay
2026-06-12 12:35:01 -04:00
parent 39d682f353
commit 2ef0efd909
2 changed files with 23 additions and 4 deletions
+10
View File
@@ -166,6 +166,16 @@ def dedup(
embed_limit: int | None = None,
) -> dict:
embedded = ensure_embeddings(conn, client, limit=embed_limit)
if embedded == 0:
# Nothing new entered the corpus → the clusters and duplicate_of links are
# unchanged, so skip the full re-cluster. It was re-running an O(n²) cosine
# pass over EVERY article and rewriting duplicate_of for all ~3.7k of them
# every cycle (~53s + a large WAL commit), which starved live API reads
# (/api/brief 2-7s). Most cycles find no new articles, so this makes the
# cycle near-instant and keeps reads fast. A real new article re-runs it.
dups = conn.execute("SELECT COUNT(*) FROM articles WHERE duplicate_of IS NOT NULL").fetchone()[0]
return {"embedded": 0, "articles": 0, "clusters": 0, "duplicate_clusters": 0,
"duplicates": dups, "skipped": True}
stats = cluster_duplicates(conn, threshold=threshold, window_days=window_days)
stats["embedded"] = embedded
return stats
+13 -4
View File
@@ -513,6 +513,10 @@ def _build_grid(words: list[str], size: int, seed: int) -> tuple[list[str], list
for word in sorted(words, key=len, reverse=True):
if len(word) > size:
continue
# Gather valid placements, then PREFER one that crosses an already-placed
# word (shares a matching letter) so the grid interlocks like a real word
# search — falling back to any valid spot so every word still gets placed.
valid = [] # (overlap_count, cells)
for _ in range(400):
dr, dc = rng.choice(_DIRS)
r0, c0 = rng.randrange(size), rng.randrange(size)
@@ -520,10 +524,15 @@ def _build_grid(words: list[str], size: int, seed: int) -> tuple[list[str], list
if any(not (0 <= r < size and 0 <= c < size) for r, c in cells):
continue
if all(grid[r][c] in (None, word[i]) for i, (r, c) in enumerate(cells)):
for i, (r, c) in enumerate(cells):
grid[r][c] = word[i]
placed.append(word)
break
overlap = sum(1 for i, (r, c) in enumerate(cells) if grid[r][c] == word[i])
valid.append((overlap, cells))
if not valid:
continue
crossing = [c for c in valid if c[0] > 0]
_, cells = rng.choice(crossing if crossing else valid)
for i, (r, c) in enumerate(cells):
grid[r][c] = word[i]
placed.append(word)
for r in range(size):
for c in range(size):
if grid[r][c] is None: