# goodNews Local-first constructive news ingestion prototype. The first milestone is intentionally small: collect public RSS/Atom metadata, dedupe it, store short source-provided snippets, and attach early reason-coded heuristic scores. It does not store full article bodies. ## Commands From this directory: ```bash python3 -m goodnews init-db python3 -m goodnews import-sources python3 -m goodnews poll --limit 3 python3 -m goodnews rescore python3 -m goodnews check-llm --base-url http://127.0.0.1:1234/v1 --model gpt-oss python3 -m goodnews classify --limit 10 --base-url http://127.0.0.1:1234/v1 --model gpt-oss python3 -m goodnews dedup --base-url http://127.0.0.1:1234/v1 python3 -m goodnews check-feeds python3 -m goodnews preview-source https://example.com/feed/ --classify python3 -m goodnews suggest-source https://example.com/feed/ --name "Example" --classify python3 -m goodnews list-candidates python3 -m goodnews promote-candidate 1 # copies into sources (inactive by default) python3 -m goodnews reject-candidate 1 python3 -m goodnews review-sources # advisory health flags (never deactivates) python3 -m goodnews build-brief --date 2026-05-27 --replace python3 -m goodnews show-brief python3 -m goodnews list-recent --limit 10 python3 -m goodnews list-recent --accepted-only --limit 10 python3 -m goodnews list-category --topic animals --flavor discovery python3 -m goodnews list-category --topic environment --flavor solution python3 -m goodnews source-report python3 -m goodnews list-runs ``` The SQLite database lives at: ```txt data/goodnews.sqlite3 ``` Sources live at: ```txt config/sources.toml ``` ## Categories When classified by the local model, each article is tagged with one **topic** and one **flavor**, allowing browsable category feeds (e.g. "feel-good animals", "environment solutions") via `list-category`: - **Topics:** science, environment, health, community, culture, animals - **Flavors:** breakthrough, discovery, solution, feelgood, perspective The allowed values live in `goodnews/taxonomy.py`. The accept/reject gate is kept deliberately broad ("not dreary"); ranking and category filters do the curation. ## Deduplication Two layers: - **Exact**: a URL hash UNIQUE constraint drops the literal same link at ingest. - **Semantic**: `dedup` embeds each article's title+snippet with the local embedding model, clusters near-identical stories within a few-day window (cosine similarity), and marks all but the highest-ranked in each cluster as `duplicate_of` the representative. Feed and brief queries hide duplicates, so the same story carried by several outlets appears once. This runs as part of `cycle`, so the scheduler keeps the corpus deduped automatically. ## Stored Article Data For each article, the database stores: - source - canonical URL - title - short RSS/Atom description or summary - author, if present - published timestamp, if present - image URL, if present - language, if present - hashes used for dedupe - heuristic scores and reason codes ## Web / API The optional `web` extra adds a FastAPI service and a small static site that consumes it. The same JSON API backs both the website and any future companion app; its auto-generated OpenAPI docs at `/docs` are the shared contract. ```bash pip install -e '.[web]' # or: .venv/bin/pip install -e '.[web]' python3 -m goodnews serve # http://127.0.0.1:8000 python3 -m goodnews serve --host 0.0.0.0 # expose on the network ``` Endpoints: - `GET /` — the static site (daily five + topic/flavor browsing) - `GET /healthz` — liveness + scored-article count - `GET /api/categories` — the topic/flavor taxonomy - `GET /api/moods` — mood modes (the humane front door: Today, Wonder, People Helping, Solutions, Light Only, Grounded) - `GET /api/category-counts` — article counts per topic/flavor - `GET /api/feed?topic=&flavor=&limit=&offset=` — ranked, filtered articles - `GET /api/brief?date=&limit=` — a daily brief (latest if no date) - `GET /api/brief-dates` — available brief dates - `GET /api/source-preview?url=&classify=` — read-only scored sample of a feed (vet before adding) - `GET /api/candidates?status=` — staged source candidates (read-only; curation is CLI-only for now) - `GET /docs` — interactive OpenAPI documentation The ingestion CLI stays pure-stdlib; only the `web` extra pulls in FastAPI/uvicorn, so the two halves can be deployed and upgraded independently. ### Frontend The site is a SvelteKit static SPA in `frontend/` (calm editorial design, mood-mode navigation, the daily brief as a hero, browsable lanes, inline Calm Filters, PWA manifest). It consumes the JSON API above, so the website and a future companion app share one contract. Build it once and FastAPI serves the output: ```bash cd frontend && npm install && npm run build # -> frontend/build cd .. && python3 -m goodnews serve # serves frontend/build at / ``` If `frontend/build` is absent, the server falls back to the legacy single-page harness in `goodnews/static/`. The Docker image builds the frontend automatically (multi-stage), so deployment is just `docker build`. ## Calm Filters Personal, device-local controls so a reader can stay informed without subjects they'd rather not see right now. Preferences live in the browser (localStorage), are sent to the read endpoints as a `prefs` JSON query param, and are applied identically to the feed, the brief, and the category counts so the numbers always match what's shown. The canonical shape (`goodnews/filters.py`): ```json { "include_topics": [], "include_flavors": [], "mute_topics": [], "mute_flavors": [], "avoid_terms": ["election", "stock market"], "pauses": [{"kind": "topic", "value": "health", "until": "2026-06-02T00:00:00Z"}] } ``` The site surfaces a humane ladder rather than a settings panel of dread: - **Not today** → pause that article's topic for 24h. - **Less like this** → ease off that flavor for ~3 days. - **Always hide …** → a standing mute (undoable in the Calm filters panel). Avoid-terms match whole words/phrases (case- and punctuation-insensitive, no substring surprises like "pan" matching "pandemic"). The brief is filtered *down* for MVP (no refill from outside the stored brief). No accounts; the same `prefs` object is the clean migration path to server-side, multi-user preferences later. ## Deployment The database is never baked into the image — the API and the ingestion CLI share one SQLite file via a mounted volume. Run ingestion (`poll`, `classify`, `build-brief`) on a schedule against the same file. ```bash docker build -t goodnews . docker run -p 8000:8000 -v /srv/goodnews/data:/data goodnews ``` `GOODNEWS_DB` controls the database path (defaults to `data/goodnews.sqlite3`). Put a reverse proxy (Caddy/nginx) in front for TLS once a domain is attached. ## Scheduling A single idempotent command runs the whole pipeline and is safe to invoke as often as you like — it only polls sources that are *due* (per each source's `poll_interval_minutes`), only classifies articles the model hasn't seen, and rebuilds the current day's brief: ```bash python3 -m goodnews cycle # poll due -> classify -> dedup -> brief -> review flags python3 -m goodnews cycle --force # poll every active source regardless of interval python3 -m goodnews cycle --no-classify # skip the LLM step (e.g. model box offline) ``` A systemd timer runs it every 15 minutes. Unit files live in `deploy/`: ```bash sudo install -d /etc/goodnews sudo install -m 644 deploy/goodnews.env.example /etc/goodnews/goodnews.env # then edit sudo install -m 644 deploy/goodnews.service deploy/goodnews.timer /etc/systemd/system/ sudo systemctl daemon-reload sudo systemctl enable --now goodnews.timer systemctl list-timers goodnews.timer # when it next runs journalctl -u goodnews.service -f # watch cycle output ``` `/etc/goodnews/goodnews.env` supplies `GOODNEWS_LLM_BASE_URL`, `GOODNEWS_LLM_MODEL`, and `GOODNEWS_DB` to the scheduled run. The timer uses `Persistent=true`, so a run missed while the machine was off is caught up on the next boot. ## Next Steps Done so far: RSS/Atom ingestion with exact + semantic dedup, heuristic + local-LLM classification with topic/flavor tagging, the daily brief, the FastAPI web/API layer and site, scheduled `cycle` via systemd, a pytest suite, and device-local Calm Filters. Still ahead: 1. **Supervised source pipeline** — preview + staging are done: `suggest-source` previews a feed and stages it in the `source_candidates` table (status suggested/quarantined/rejected/promoted); `promote-candidate` copies it into `sources` (inactive by default — active on approval); promotion is never automatic. Advisory health is done too: `review-sources` (also run at the end of `cycle`) flags stale, failing, low-acceptance, duplicate-heavy, or doom-skewed feeds for human review — it never deactivates anything. Still ahead: an authenticated POST surface so the website can accept public suggestions once accounts exist. 2. **Learned "Less like this" weighting** — replace the interim flavor-pause with real preference down-ranking. 3. **Corpus rebalancing** — add calm/feelgood sources (currently science-heavy). 4. **Retention/pruning** — soft-delete + time-window indexes as the corpus grows toward ~10k articles (don't rush; not yet needed). 5. **Go-public hardening** — TLS via a reverse proxy, then a domain. ## Local Model Configuration The `classify` command expects an OpenAI-compatible local chat-completions server. You can pass settings directly: ```bash python3 -m goodnews classify --base-url http://127.0.0.1:1234/v1 --model gpt-oss --limit 10 ``` Or use environment variables: ```bash export GOODNEWS_LLM_BASE_URL=http://127.0.0.1:1234/v1 export GOODNEWS_LLM_MODEL=gpt-oss python3 -m goodnews classify --limit 10 ``` `classify` rewrites the current score/reason row for selected candidates. `rescore` can restore the fast heuristic scores.