Perf: skip needless dedup re-cluster + interlock word-search grids
Two things found while chasing the recurring ~15min slowness: - dedup.py: cluster_duplicates re-ran an O(n²) cosine pass over ALL ~3.7k articles and rewrote duplicate_of for every one of them EVERY cycle — even when nothing new arrived (embedded=0) — ~53s CPU + a large WAL commit that starved live API reads (/api/brief 2-7s). Now skip the re-cluster entirely when nothing new was embedded (clusters can't have changed). Verified: cycle drops from ~53s to ~1s and /api/brief stays at 20ms through a cycle, vs 2-7s before. (A real new article still triggers a full re-cluster.) - games.py _build_grid: word placement took the first random valid spot, so words rarely crossed. Now gather valid placements and PREFER ones that cross an already-placed word (shared matching letter), falling back to any valid spot — so the grid interlocks like a real word search. Every word still placed (tests green). NOTE: changes today's grid layouts, so an in-progress word search resets once. Also added a systemd drop-in (Nice=19/CPUWeight=20/IOWeight=10/ionice-idle) to deprioritize the batch cycle — minor, the dedup skip is the real fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -166,6 +166,16 @@ def dedup(
|
||||
embed_limit: int | None = None,
|
||||
) -> dict:
|
||||
embedded = ensure_embeddings(conn, client, limit=embed_limit)
|
||||
if embedded == 0:
|
||||
# Nothing new entered the corpus → the clusters and duplicate_of links are
|
||||
# unchanged, so skip the full re-cluster. It was re-running an O(n²) cosine
|
||||
# pass over EVERY article and rewriting duplicate_of for all ~3.7k of them
|
||||
# every cycle (~53s + a large WAL commit), which starved live API reads
|
||||
# (/api/brief 2-7s). Most cycles find no new articles, so this makes the
|
||||
# cycle near-instant and keeps reads fast. A real new article re-runs it.
|
||||
dups = conn.execute("SELECT COUNT(*) FROM articles WHERE duplicate_of IS NOT NULL").fetchone()[0]
|
||||
return {"embedded": 0, "articles": 0, "clusters": 0, "duplicate_clusters": 0,
|
||||
"duplicates": dups, "skipped": True}
|
||||
stats = cluster_duplicates(conn, threshold=threshold, window_days=window_days)
|
||||
stats["embedded"] = embedded
|
||||
return stats
|
||||
|
||||
+13
-4
@@ -513,6 +513,10 @@ def _build_grid(words: list[str], size: int, seed: int) -> tuple[list[str], list
|
||||
for word in sorted(words, key=len, reverse=True):
|
||||
if len(word) > size:
|
||||
continue
|
||||
# Gather valid placements, then PREFER one that crosses an already-placed
|
||||
# word (shares a matching letter) so the grid interlocks like a real word
|
||||
# search — falling back to any valid spot so every word still gets placed.
|
||||
valid = [] # (overlap_count, cells)
|
||||
for _ in range(400):
|
||||
dr, dc = rng.choice(_DIRS)
|
||||
r0, c0 = rng.randrange(size), rng.randrange(size)
|
||||
@@ -520,10 +524,15 @@ def _build_grid(words: list[str], size: int, seed: int) -> tuple[list[str], list
|
||||
if any(not (0 <= r < size and 0 <= c < size) for r, c in cells):
|
||||
continue
|
||||
if all(grid[r][c] in (None, word[i]) for i, (r, c) in enumerate(cells)):
|
||||
for i, (r, c) in enumerate(cells):
|
||||
grid[r][c] = word[i]
|
||||
placed.append(word)
|
||||
break
|
||||
overlap = sum(1 for i, (r, c) in enumerate(cells) if grid[r][c] == word[i])
|
||||
valid.append((overlap, cells))
|
||||
if not valid:
|
||||
continue
|
||||
crossing = [c for c in valid if c[0] > 0]
|
||||
_, cells = rng.choice(crossing if crossing else valid)
|
||||
for i, (r, c) in enumerate(cells):
|
||||
grid[r][c] = word[i]
|
||||
placed.append(word)
|
||||
for r in range(size):
|
||||
for c in range(size):
|
||||
if grid[r][c] is None:
|
||||
|
||||
Reference in New Issue
Block a user