- LocalModelClient.embed() calls the OpenAI-compatible /embeddings endpoint (local nomic model); base_url shared with chat, model via GOODNEWS_EMBED_MODEL. - New article_embeddings table and articles.duplicate_of column (+ migration). - dedup module: embeds missing articles, clusters near-identical stories within a date window by cosine similarity (pure-stdlib, vectors normalised once), and marks all but the highest-ranked member of each cluster as a duplicate. - 'dedup' CLI command; cycle now runs poll -> classify -> dedup -> brief. - Feed and brief queries hide duplicates, so a story carried by multiple outlets shows once. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
6.3 KiB
goodNews
Local-first constructive news ingestion prototype.
The first milestone is intentionally small: collect public RSS/Atom metadata, dedupe it, store short source-provided snippets, and attach early reason-coded heuristic scores. It does not store full article bodies.
Commands
From this directory:
python3 -m goodnews init-db
python3 -m goodnews import-sources
python3 -m goodnews poll --limit 3
python3 -m goodnews rescore
python3 -m goodnews check-llm --base-url http://127.0.0.1:1234/v1 --model gpt-oss
python3 -m goodnews classify --limit 10 --base-url http://127.0.0.1:1234/v1 --model gpt-oss
python3 -m goodnews dedup --base-url http://127.0.0.1:1234/v1
python3 -m goodnews check-feeds
python3 -m goodnews build-brief --date 2026-05-27 --replace
python3 -m goodnews show-brief
python3 -m goodnews list-recent --limit 10
python3 -m goodnews list-recent --accepted-only --limit 10
python3 -m goodnews list-category --topic animals --flavor discovery
python3 -m goodnews list-category --topic environment --flavor solution
python3 -m goodnews source-report
python3 -m goodnews list-runs
The SQLite database lives at:
data/goodnews.sqlite3
Sources live at:
config/sources.toml
Categories
When classified by the local model, each article is tagged with one topic
and one flavor, allowing browsable category feeds (e.g. "feel-good animals",
"environment solutions") via list-category:
- Topics: science, environment, health, community, culture, animals
- Flavors: breakthrough, discovery, solution, feelgood, perspective
The allowed values live in goodnews/taxonomy.py. The accept/reject gate is kept
deliberately broad ("not dreary"); ranking and category filters do the curation.
Deduplication
Two layers:
- Exact: a URL hash UNIQUE constraint drops the literal same link at ingest.
- Semantic:
dedupembeds each article's title+snippet with the local embedding model, clusters near-identical stories within a few-day window (cosine similarity), and marks all but the highest-ranked in each cluster asduplicate_ofthe representative. Feed and brief queries hide duplicates, so the same story carried by several outlets appears once. This runs as part ofcycle, so the scheduler keeps the corpus deduped automatically.
Stored Article Data
For each article, the database stores:
- source
- canonical URL
- title
- short RSS/Atom description or summary
- author, if present
- published timestamp, if present
- image URL, if present
- language, if present
- hashes used for dedupe
- heuristic scores and reason codes
Web / API
The optional web extra adds a FastAPI service and a small static site that
consumes it. The same JSON API backs both the website and any future companion
app; its auto-generated OpenAPI docs at /docs are the shared contract.
pip install -e '.[web]' # or: .venv/bin/pip install -e '.[web]'
python3 -m goodnews serve # http://127.0.0.1:8000
python3 -m goodnews serve --host 0.0.0.0 # expose on the network
Endpoints:
GET /— the static site (daily five + topic/flavor browsing)GET /healthz— liveness + scored-article countGET /api/categories— the topic/flavor taxonomyGET /api/category-counts— article counts per topic/flavorGET /api/feed?topic=&flavor=&limit=&offset=— ranked, filtered articlesGET /api/brief?date=&limit=— a daily brief (latest if no date)GET /api/brief-dates— available brief datesGET /docs— interactive OpenAPI documentation
The ingestion CLI stays pure-stdlib; only the web extra pulls in FastAPI/uvicorn,
so the two halves can be deployed and upgraded independently.
Deployment
The database is never baked into the image — the API and the ingestion CLI share
one SQLite file via a mounted volume. Run ingestion (poll, classify,
build-brief) on a schedule against the same file.
docker build -t goodnews .
docker run -p 8000:8000 -v /srv/goodnews/data:/data goodnews
GOODNEWS_DB controls the database path (defaults to data/goodnews.sqlite3).
Put a reverse proxy (Caddy/nginx) in front for TLS once a domain is attached.
Scheduling
A single idempotent command runs the whole pipeline and is safe to invoke as
often as you like — it only polls sources that are due (per each source's
poll_interval_minutes), only classifies articles the model hasn't seen, and
rebuilds the current day's brief:
python3 -m goodnews cycle # poll due -> classify new -> dedup -> rebuild today's brief
python3 -m goodnews cycle --force # poll every active source regardless of interval
python3 -m goodnews cycle --no-classify # skip the LLM step (e.g. model box offline)
A systemd timer runs it every 15 minutes. Unit files live in deploy/:
sudo install -d /etc/goodnews
sudo install -m 644 deploy/goodnews.env.example /etc/goodnews/goodnews.env # then edit
sudo install -m 644 deploy/goodnews.service deploy/goodnews.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now goodnews.timer
systemctl list-timers goodnews.timer # when it next runs
journalctl -u goodnews.service -f # watch cycle output
/etc/goodnews/goodnews.env supplies GOODNEWS_LLM_BASE_URL, GOODNEWS_LLM_MODEL,
and GOODNEWS_DB to the scheduled run. The timer uses Persistent=true, so a
run missed while the machine was off is caught up on the next boot.
Next Steps
- Run the poller for a few days and inspect which sources produce useful candidates.
- Add source-level quality notes and deactivate noisy feeds.
- Replace or supplement
heuristic-v0with a local model classifier. - Add a daily brief builder that selects 5 items using scores and source diversity.
- Add a small web/API layer once the ingest data looks trustworthy.
Local Model Configuration
The classify command expects an OpenAI-compatible local chat-completions server.
You can pass settings directly:
python3 -m goodnews classify --base-url http://127.0.0.1:1234/v1 --model gpt-oss --limit 10
Or use environment variables:
export GOODNEWS_LLM_BASE_URL=http://127.0.0.1:1234/v1
export GOODNEWS_LLM_MODEL=gpt-oss
python3 -m goodnews classify --limit 10
classify rewrites the current score/reason row for selected candidates. rescore can restore the fast heuristic scores.