Add semantic cross-source dedup via local embeddings

- LocalModelClient.embed() calls the OpenAI-compatible /embeddings endpoint
  (local nomic model); base_url shared with chat, model via GOODNEWS_EMBED_MODEL.
- New article_embeddings table and articles.duplicate_of column (+ migration).
- dedup module: embeds missing articles, clusters near-identical stories within
  a date window by cosine similarity (pure-stdlib, vectors normalised once), and
  marks all but the highest-ranked member of each cluster as a duplicate.
- 'dedup' CLI command; cycle now runs poll -> classify -> dedup -> brief.
- Feed and brief queries hide duplicates, so a story carried by multiple
  outlets shows once.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
jay
2026-05-30 15:40:55 +00:00
parent 2a9c49e2a9
commit 5d44072fca
7 changed files with 259 additions and 4 deletions
+15 -1
View File
@@ -15,6 +15,8 @@ python3 -m goodnews poll --limit 3
python3 -m goodnews rescore
python3 -m goodnews check-llm --base-url http://127.0.0.1:1234/v1 --model gpt-oss
python3 -m goodnews classify --limit 10 --base-url http://127.0.0.1:1234/v1 --model gpt-oss
python3 -m goodnews dedup --base-url http://127.0.0.1:1234/v1
python3 -m goodnews check-feeds
python3 -m goodnews build-brief --date 2026-05-27 --replace
python3 -m goodnews show-brief
python3 -m goodnews list-recent --limit 10
@@ -49,6 +51,18 @@ and one **flavor**, allowing browsable category feeds (e.g. "feel-good animals",
The allowed values live in `goodnews/taxonomy.py`. The accept/reject gate is kept
deliberately broad ("not dreary"); ranking and category filters do the curation.
## Deduplication
Two layers:
- **Exact**: a URL hash UNIQUE constraint drops the literal same link at ingest.
- **Semantic**: `dedup` embeds each article's title+snippet with the local
embedding model, clusters near-identical stories within a few-day window
(cosine similarity), and marks all but the highest-ranked in each cluster as
`duplicate_of` the representative. Feed and brief queries hide duplicates, so
the same story carried by several outlets appears once. This runs as part of
`cycle`, so the scheduler keeps the corpus deduped automatically.
## Stored Article Data
For each article, the database stores:
@@ -112,7 +126,7 @@ often as you like — it only polls sources that are *due* (per each source's
rebuilds the current day's brief:
```bash
python3 -m goodnews cycle # poll due -> classify new -> rebuild today's brief
python3 -m goodnews cycle # poll due -> classify new -> dedup -> rebuild today's brief
python3 -m goodnews cycle --force # poll every active source regardless of interval
python3 -m goodnews cycle --no-classify # skip the LLM step (e.g. model box offline)
```