Add semantic cross-source dedup via local embeddings

- LocalModelClient.embed() calls the OpenAI-compatible /embeddings endpoint (local nomic model); base_url shared with chat, model via GOODNEWS_EMBED_MODEL. - New article_embeddings table and articles.duplicate_of column (+ migration). - dedup module: embeds missing articles, clusters near-identical stories within a date window by cosine similarity (pure-stdlib, vectors normalised once), and marks all but the highest-ranked member of each cluster as a duplicate. - 'dedup' CLI command; cycle now runs poll -> classify -> dedup -> brief. - Feed and brief queries hide duplicates, so a story carried by multiple outlets shows once. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 15:40:55 +00:00
parent 2a9c49e2a9
commit 5d44072fca
7 changed files with 259 additions and 4 deletions
@@ -15,6 +15,8 @@ python3 -m goodnews poll --limit 3
 python3 -m goodnews rescore
 python3 -m goodnews check-llm --base-url http://127.0.0.1:1234/v1 --model gpt-oss
 python3 -m goodnews classify --limit 10 --base-url http://127.0.0.1:1234/v1 --model gpt-oss
+python3 -m goodnews dedup --base-url http://127.0.0.1:1234/v1
+python3 -m goodnews check-feeds
 python3 -m goodnews build-brief --date 2026-05-27 --replace
 python3 -m goodnews show-brief
 python3 -m goodnews list-recent --limit 10
@@ -49,6 +51,18 @@ and one **flavor**, allowing browsable category feeds (e.g. "feel-good animals",
 The allowed values live in `goodnews/taxonomy.py`. The accept/reject gate is kept
 deliberately broad ("not dreary"); ranking and category filters do the curation.

+## Deduplication
+
+Two layers:
+
+- **Exact**: a URL hash UNIQUE constraint drops the literal same link at ingest.
+- **Semantic**: `dedup` embeds each article's title+snippet with the local
+  embedding model, clusters near-identical stories within a few-day window
+  (cosine similarity), and marks all but the highest-ranked in each cluster as
+  `duplicate_of` the representative. Feed and brief queries hide duplicates, so
+  the same story carried by several outlets appears once. This runs as part of
+  `cycle`, so the scheduler keeps the corpus deduped automatically.
+
 ## Stored Article Data

 For each article, the database stores:
@@ -112,7 +126,7 @@ often as you like — it only polls sources that are *due* (per each source's
 rebuilds the current day's brief:

 ```bash
-python3 -m goodnews cycle                 # poll due -> classify new -> rebuild today's brief
+python3 -m goodnews cycle                 # poll due -> classify new -> dedup -> rebuild today's brief
 python3 -m goodnews cycle --force         # poll every active source regardless of interval
 python3 -m goodnews cycle --no-classify   # skip the LLM step (e.g. model box offline)
 ```