Perf: skip needless dedup re-cluster + interlock word-search grids

Two things found while chasing the recurring ~15min slowness: - dedup.py: cluster_duplicates re-ran an O(n²) cosine pass over ALL ~3.7k articles and rewrote duplicate_of for every one of them EVERY cycle — even when nothing new arrived (embedded=0) — ~53s CPU + a large WAL commit that starved live API reads (/api/brief 2-7s). Now skip the re-cluster entirely when nothing new was embedded (clusters can't have changed). Verified: cycle drops from ~53s to ~1s and /api/brief stays at 20ms through a cycle, vs 2-7s before. (A real new article still triggers a full re-cluster.) - games.py _build_grid: word placement took the first random valid spot, so words rarely crossed. Now gather valid placements and PREFER ones that cross an already-placed word (shared matching letter), falling back to any valid spot — so the grid interlocks like a real word search. Every word still placed (tests green). NOTE: changes today's grid layouts, so an in-progress word search resets once. Also added a systemd drop-in (Nice=19/CPUWeight=20/IOWeight=10/ionice-idle) to deprioritize the batch cycle — minor, the dedup skip is the real fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 12:35:01 -04:00
parent 39d682f353
commit 2ef0efd909
2 changed files with 23 additions and 4 deletions
@@ -166,6 +166,16 @@ def dedup(
    embed_limit: int | None = None,
 ) -> dict:
    embedded = ensure_embeddings(conn, client, limit=embed_limit)
+    if embedded == 0:
+        # Nothing new entered the corpus → the clusters and duplicate_of links are
+        # unchanged, so skip the full re-cluster. It was re-running an O(n²) cosine
+        # pass over EVERY article and rewriting duplicate_of for all ~3.7k of them
+        # every cycle (~53s + a large WAL commit), which starved live API reads
+        # (/api/brief 2-7s). Most cycles find no new articles, so this makes the
+        # cycle near-instant and keeps reads fast. A real new article re-runs it.
+        dups = conn.execute("SELECT COUNT(*) FROM articles WHERE duplicate_of IS NOT NULL").fetchone()[0]
+        return {"embedded": 0, "articles": 0, "clusters": 0, "duplicate_clusters": 0,
+                "duplicates": dups, "skipped": True}
    stats = cluster_duplicates(conn, threshold=threshold, window_days=window_days)
    stats["embedded"] = embedded
    return stats
@@ -513,6 +513,10 @@ def _build_grid(words: list[str], size: int, seed: int) -> tuple[list[str], list
    for word in sorted(words, key=len, reverse=True):
        if len(word) > size:
            continue
+        # Gather valid placements, then PREFER one that crosses an already-placed
+        # word (shares a matching letter) so the grid interlocks like a real word
+        # search — falling back to any valid spot so every word still gets placed.
+        valid = []  # (overlap_count, cells)
        for _ in range(400):
            dr, dc = rng.choice(_DIRS)
            r0, c0 = rng.randrange(size), rng.randrange(size)
@@ -520,10 +524,15 @@ def _build_grid(words: list[str], size: int, seed: int) -> tuple[list[str], list
            if any(not (0 <= r < size and 0 <= c < size) for r, c in cells):
                continue
            if all(grid[r][c] in (None, word[i]) for i, (r, c) in enumerate(cells)):
-                for i, (r, c) in enumerate(cells):
-                    grid[r][c] = word[i]
-                placed.append(word)
-                break
+                overlap = sum(1 for i, (r, c) in enumerate(cells) if grid[r][c] == word[i])
+                valid.append((overlap, cells))
+        if not valid:
+            continue
+        crossing = [c for c in valid if c[0] > 0]
+        _, cells = rng.choice(crossing if crossing else valid)
+        for i, (r, c) in enumerate(cells):
+            grid[r][c] = word[i]
+        placed.append(word)
    for r in range(size):
        for c in range(size):
            if grid[r][c] is None: