upbeatBytes/README.md

# goodNews

Local-first constructive news ingestion prototype.

The first milestone is intentionally small: collect public RSS/Atom metadata, dedupe it, store short source-provided snippets, and attach early reason-coded heuristic scores. It does not store full article bodies.

## Commands

From this directory:

```bash
python3 -m goodnews init-db
python3 -m goodnews import-sources
python3 -m goodnews poll --limit 3
python3 -m goodnews rescore
python3 -m goodnews check-llm --base-url http://127.0.0.1:1234/v1 --model gpt-oss
python3 -m goodnews classify --limit 10 --base-url http://127.0.0.1:1234/v1 --model gpt-oss
python3 -m goodnews dedup --base-url http://127.0.0.1:1234/v1
python3 -m goodnews check-feeds
python3 -m goodnews preview-source https://example.com/feed/ --classify
python3 -m goodnews suggest-source https://example.com/feed/ --name "Example" --classify
python3 -m goodnews list-candidates
python3 -m goodnews promote-candidate 1        # copies into sources (inactive by default)
python3 -m goodnews reject-candidate 1
python3 -m goodnews review-sources             # advisory health flags (never deactivates)
python3 -m goodnews build-brief --date 2026-05-27 --replace
python3 -m goodnews show-brief
python3 -m goodnews list-recent --limit 10
python3 -m goodnews list-recent --accepted-only --limit 10
python3 -m goodnews list-category --topic animals --flavor discovery
python3 -m goodnews list-category --topic environment --flavor solution
python3 -m goodnews source-report
python3 -m goodnews list-runs
```

The SQLite database lives at:

```txt
data/goodnews.sqlite3
```

Sources live at:

```txt
config/sources.toml
```

## Categories

When classified by the local model, each article is tagged with one **topic**
and one **flavor**, allowing browsable category feeds (e.g. "feel-good animals",
"environment solutions") via `list-category`:

- **Topics:** science, environment, health, community, culture, animals
- **Flavors:** breakthrough, discovery, solution, feelgood, perspective

The allowed values live in `goodnews/taxonomy.py`. The accept/reject gate is kept
deliberately broad ("not dreary"); ranking and category filters do the curation.

## Deduplication

Two layers:

- **Exact**: a URL hash UNIQUE constraint drops the literal same link at ingest.
- **Semantic**: `dedup` embeds each article's title+snippet with the local
  embedding model, clusters near-identical stories within a few-day window
  (cosine similarity), and marks all but the highest-ranked in each cluster as
  `duplicate_of` the representative. Feed and brief queries hide duplicates, so
  the same story carried by several outlets appears once. This runs as part of
  `cycle`, so the scheduler keeps the corpus deduped automatically.

## Stored Article Data

For each article, the database stores:

- source
- canonical URL
- title
- short RSS/Atom description or summary
- author, if present
- published timestamp, if present
- image URL, if present
- language, if present
- hashes used for dedupe
- heuristic scores and reason codes

## Web / API

The optional `web` extra adds a FastAPI service and a small static site that
consumes it. The same JSON API backs both the website and any future companion
app; its auto-generated OpenAPI docs at `/docs` are the shared contract.

```bash
pip install -e '.[web]'          # or: .venv/bin/pip install -e '.[web]'
python3 -m goodnews serve                  # http://127.0.0.1:8000
python3 -m goodnews serve --host 0.0.0.0   # expose on the network
```

Endpoints:

- `GET /` — the static site (daily five + topic/flavor browsing)
- `GET /healthz` — liveness + scored-article count
- `GET /api/categories` — the topic/flavor taxonomy
- `GET /api/moods` — mood modes (the humane front door: Today, Wonder, People Helping, Solutions, Light Only, Grounded)
- `GET /api/category-counts` — article counts per topic/flavor
- `GET /api/feed?topic=&flavor=&limit=&offset=` — ranked, filtered articles
- `GET /api/brief?date=&limit=` — a daily brief (latest if no date)
- `GET /api/brief-dates` — available brief dates
- `GET /api/source-preview?url=&classify=` — read-only scored sample of a feed (vet before adding)
- `GET /api/candidates?status=` — staged source candidates (read-only; curation is CLI-only for now)
- `GET /docs` — interactive OpenAPI documentation

The ingestion CLI stays pure-stdlib; only the `web` extra pulls in FastAPI/uvicorn,
so the two halves can be deployed and upgraded independently.

### Frontend

The site is a SvelteKit static SPA in `frontend/` (calm editorial design, mood-mode
navigation, the daily brief as a hero, browsable lanes, inline Calm Filters, PWA
manifest). It consumes the JSON API above, so the website and a future companion
app share one contract. Build it once and FastAPI serves the output:

```bash
cd frontend && npm install && npm run build   # -> frontend/build
cd .. && python3 -m goodnews serve             # serves frontend/build at /
```

If `frontend/build` is absent, the server falls back to the legacy single-page
harness in `goodnews/static/`. The Docker image builds the frontend automatically
(multi-stage), so deployment is just `docker build`.

## Calm Filters

Personal, device-local controls so a reader can stay informed without subjects
they'd rather not see right now. Preferences live in the browser (localStorage),
are sent to the read endpoints as a `prefs` JSON query param, and are applied
identically to the feed, the brief, and the category counts so the numbers always
match what's shown. The canonical shape (`goodnews/filters.py`):

```json
{
  "include_topics": [], "include_flavors": [],
  "mute_topics": [], "mute_flavors": [],
  "avoid_terms": ["election", "stock market"],
  "pauses": [{"kind": "topic", "value": "health", "until": "2026-06-02T00:00:00Z"}]
}
```

The site surfaces a humane ladder rather than a settings panel of dread:

- **Not today** → pause that article's topic for 24h.
- **Less like this** → ease off that flavor for ~3 days.
- **Always hide …** → a standing mute (undoable in the Calm filters panel).

Avoid-terms match whole words/phrases (case- and punctuation-insensitive, no
substring surprises like "pan" matching "pandemic"). The brief is filtered *down*
for MVP (no refill from outside the stored brief). No accounts; the same `prefs`
object is the clean migration path to server-side, multi-user preferences later.

## Deployment

The database is never baked into the image — the API and the ingestion CLI share
one SQLite file via a mounted volume. Run ingestion (`poll`, `classify`,
`build-brief`) on a schedule against the same file.

```bash
docker build -t goodnews .
docker run -p 8000:8000 -v /srv/goodnews/data:/data goodnews
```

`GOODNEWS_DB` controls the database path (defaults to `data/goodnews.sqlite3`).
Put a reverse proxy (Caddy/nginx) in front for TLS once a domain is attached.

## Scheduling

A single idempotent command runs the whole pipeline and is safe to invoke as
often as you like — it only polls sources that are *due* (per each source's
`poll_interval_minutes`), only classifies articles the model hasn't seen, and
rebuilds the current day's brief:

```bash
python3 -m goodnews cycle                 # poll due -> classify -> dedup -> brief -> review flags
python3 -m goodnews cycle --force         # poll every active source regardless of interval
python3 -m goodnews cycle --no-classify   # skip the LLM step (e.g. model box offline)
```

A systemd timer runs it every 15 minutes. Unit files live in `deploy/`:

```bash
sudo install -d /etc/goodnews
sudo install -m 644 deploy/goodnews.env.example /etc/goodnews/goodnews.env  # then edit
sudo install -m 644 deploy/goodnews.service deploy/goodnews.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now goodnews.timer

systemctl list-timers goodnews.timer          # when it next runs
journalctl -u goodnews.service -f             # watch cycle output
```

`/etc/goodnews/goodnews.env` supplies `GOODNEWS_LLM_BASE_URL`, `GOODNEWS_LLM_MODEL`,
and `GOODNEWS_DB` to the scheduled run. The timer uses `Persistent=true`, so a
run missed while the machine was off is caught up on the next boot.

## Next Steps

Done so far: RSS/Atom ingestion with exact + semantic dedup, heuristic + local-LLM
classification with topic/flavor tagging, the daily brief, the FastAPI web/API layer
and site, scheduled `cycle` via systemd, a pytest suite, and device-local Calm Filters.

Still ahead:

1. **Supervised source pipeline** — preview + staging are done: `suggest-source`
   previews a feed and stages it in the `source_candidates` table (status
   suggested/quarantined/rejected/promoted); `promote-candidate` copies it into
   `sources` (inactive by default — active on approval); promotion is never
   automatic. Advisory health is done too: `review-sources` (also run at the end
   of `cycle`) flags stale, failing, low-acceptance, duplicate-heavy, or
   doom-skewed feeds for human review — it never deactivates anything. Still
   ahead: an authenticated POST surface so the website can accept public
   suggestions once accounts exist.
2. **Learned "Less like this" weighting** — replace the interim flavor-pause with
   real preference down-ranking.
3. **Corpus rebalancing** — add calm/feelgood sources (currently science-heavy).
4. **Retention/pruning** — soft-delete + time-window indexes as the corpus grows
   toward ~10k articles (don't rush; not yet needed).
5. **Go-public hardening** — TLS via a reverse proxy, then a domain.

## Local Model Configuration

The `classify` command expects an OpenAI-compatible local chat-completions server.

You can pass settings directly:

```bash
python3 -m goodnews classify --base-url http://127.0.0.1:1234/v1 --model gpt-oss --limit 10
```

Or use environment variables:

```bash
export GOODNEWS_LLM_BASE_URL=http://127.0.0.1:1234/v1
export GOODNEWS_LLM_MODEL=gpt-oss
python3 -m goodnews classify --limit 10
```

`classify` rewrites the current score/reason row for selected candidates. `rescore` can restore the fast heuristic scores.