Add semantic cross-source dedup via local embeddings
- LocalModelClient.embed() calls the OpenAI-compatible /embeddings endpoint (local nomic model); base_url shared with chat, model via GOODNEWS_EMBED_MODEL. - New article_embeddings table and articles.duplicate_of column (+ migration). - dedup module: embeds missing articles, clusters near-identical stories within a date window by cosine similarity (pure-stdlib, vectors normalised once), and marks all but the highest-ranked member of each cluster as a duplicate. - 'dedup' CLI command; cycle now runs poll -> classify -> dedup -> brief. - Feed and brief queries hide duplicates, so a story carried by multiple outlets shows once. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -15,6 +15,8 @@ python3 -m goodnews poll --limit 3
|
||||
python3 -m goodnews rescore
|
||||
python3 -m goodnews check-llm --base-url http://127.0.0.1:1234/v1 --model gpt-oss
|
||||
python3 -m goodnews classify --limit 10 --base-url http://127.0.0.1:1234/v1 --model gpt-oss
|
||||
python3 -m goodnews dedup --base-url http://127.0.0.1:1234/v1
|
||||
python3 -m goodnews check-feeds
|
||||
python3 -m goodnews build-brief --date 2026-05-27 --replace
|
||||
python3 -m goodnews show-brief
|
||||
python3 -m goodnews list-recent --limit 10
|
||||
@@ -49,6 +51,18 @@ and one **flavor**, allowing browsable category feeds (e.g. "feel-good animals",
|
||||
The allowed values live in `goodnews/taxonomy.py`. The accept/reject gate is kept
|
||||
deliberately broad ("not dreary"); ranking and category filters do the curation.
|
||||
|
||||
## Deduplication
|
||||
|
||||
Two layers:
|
||||
|
||||
- **Exact**: a URL hash UNIQUE constraint drops the literal same link at ingest.
|
||||
- **Semantic**: `dedup` embeds each article's title+snippet with the local
|
||||
embedding model, clusters near-identical stories within a few-day window
|
||||
(cosine similarity), and marks all but the highest-ranked in each cluster as
|
||||
`duplicate_of` the representative. Feed and brief queries hide duplicates, so
|
||||
the same story carried by several outlets appears once. This runs as part of
|
||||
`cycle`, so the scheduler keeps the corpus deduped automatically.
|
||||
|
||||
## Stored Article Data
|
||||
|
||||
For each article, the database stores:
|
||||
@@ -112,7 +126,7 @@ often as you like — it only polls sources that are *due* (per each source's
|
||||
rebuilds the current day's brief:
|
||||
|
||||
```bash
|
||||
python3 -m goodnews cycle # poll due -> classify new -> rebuild today's brief
|
||||
python3 -m goodnews cycle # poll due -> classify new -> dedup -> rebuild today's brief
|
||||
python3 -m goodnews cycle --force # poll every active source regardless of interval
|
||||
python3 -m goodnews cycle --no-classify # skip the LLM step (e.g. model box offline)
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user